A large team has examined millions of biomedical documents in order to see how various text similarity methods cluster the different articles. These techniques, grouped under the loose banner of machine learning, look at how words appear together in an article, the frequency of words, and more, in order to create a rich picture of how documents are related to each other. Downloading over two million documents from MEDLINE, they tested how PubMed‘s built-in related article methodology compares to a number of other machine learning techniques. The analysis, titled Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches was published this month in PLoS ONE.
While the result — PubMed is the best — is both gratifying and not entirely earth-shattering, there is a fun figure from the article that looks at the natural document clusters that jump out from the analysis:
These groupings were made by inspection of the 29,000 clusters that the automated methodology found. It’s nice when machine learning yields clear meaning.