Thematic Classification

This section gives an overview of the thematic classification steps used to produce the final sample of relevant documents.

The following sub-sections contain the online coding tool that was implemented to facilitate the manual coding process, and last, the selection protocol on which the manual coding decision was based, the reliabilities for the manual coding, and an evaluation of the classifiers performance.

Process Overview

Classification process overview
An overview of the steps in the classification process.

Process Steps

  1. Load Data

    The first step in this process was to load and join the data from the data dump containing all documents, the source metadata as well as the relevance scores generated in the Topic Modeling and Relevance Scoring step.

  2. Shuffle

    Secondly, all documents were shuffled, to be able to draw a random sample of documents later.

  3. Binning

    Thirdly, the documents were binned using their relevance score, in case a topic model and therefore a score was calculated for this document in the Topic Modeling and Relevance Scoring step. The documents were then sampled into slices of 2000 articles, keeping the random order generated in step 2. Therefore the relevance score was used to identify news items with a high probability of being relevant, while still ensuring, that items get randomly selected from a larger pool. This ensures that the classification model does not suffer from overfitting and increases the recall of the classifier. A further description is given in section Online Coding.

  4. Package for Online Delivery

    After that, each of the items was packaged for online delivery in JSON format. Each JSON file contained 20 news items for a given source in the order generated by the random shuffling and, if available, the binning results.

  5. Code in Online Interface

    The JSON files were then used by the MedCon Relevance Coder (MRC) and presented to human coders, who identified the relevance for each document. The coding was done in a 2 step process and followed the two-man rule. It is further described in section Online Coding.