This section gives an overview of the thematic classification steps used to produce the final sample of relevant documents.
The following sub-sections contain the online coding tool that was implemented to facilitate the manual coding process, and last, the selection protocol on which the manual coding decision was based, the reliabilities for the manual coding, and an evaluation of the classifiers performance.
The first step in this process was to load and join the data from the data dump containing all documents, the source metadata as well as the relevance scores generated in the Topic Modeling and Relevance Scoring step.
Secondly, all documents were shuffled, to be able to draw a random sample of documents later.
Thirdly, the documents were binned using their relevance score, in case a topic model and therefore a score was calculated for this document in the Topic Modeling and Relevance Scoring step. The documents were then sampled into slices of 2000 articles, keeping the random order generated in step 2. Therefore the relevance score was used to identify news items with a high probability of being relevant, while still ensuring, that items get randomly selected from a larger pool. This ensures that the classification model does not suffer from overfitting and increases the recall of the classifier. A further description is given in section Online Coding.
After that, each of the items was packaged for online delivery in JSON format. Each JSON file contained 20 news items for a given source in the order generated by the random shuffling and, if available, the binning results.
The JSON files were then used by the MedCon Relevance Coder (MRC) and presented to human coders, who identified the relevance for each document. The coding was done in a 2 step process and followed the two-man rule. It is further described in section Online Coding.