Tuesday, April 11, 2023

Recommendation systems for papers

The volume of papers being published and preprints being submitted to the arXiv has grown enormously since the time when I started my PhD studies. Moreover, competing preprint platforms such as Optica Open have been launched. This deluge of papers is impossible to keep up with. 

Therefore, there is growing interest in developing advanced bibliometric tools in order to stay abreast of the most important developments in one's own research field. There are two distinct approaches to solving this issue. 

The first is crowdsourcing. Websites such as scirate (founded by and most widely used by quantum physicists) and PubPeer (seems to be popular in life sciences) and allow users to upvote and comment on preprints they think are interesting. Preprints that are upvoted more appear higher on the page and therefore receive more attention and more views. The idea then is that the most important works will be upvoted more and will be seen by more people.

Other approaches are based on machine learning or bibliometric analysis methods where a new paper is analyzed by model that takes as its input various attributes of the paper, such as the authors, the topic, keywords, and its reference list. Models can be trained to pick out papers which are likely to relevant or more important and show them to the end user. 

One example of this now supported by arXiv is Litmaps, which constructs a graph that places an article in the context of previous and subsequent works. The visualisation can be customised to highlight different features, such as the publication date and total number of citations in the plot below. It seems at least that for the example below the "Seed" map seems biased towards highlighting review articles (missing hot recent results such as those on quantized Thouless pumping of solitons). The other visualisations offered are "Discover" (for finding overlooked papers) and "Map" (for telling a "research story"), but they require an account, and presumably a subscription for serious use.

Litmap of "Edge solitons in nonlinear photonic topological insulators"

Each approach has its own pros and cons. One problem with the crowdsourcing approach is that it can be sensitive to initial perturbations and tends to amplify existing well-known authors while inhibiting the promotion of less well-known authors. For example, people may upvote a paper just because it is written by familiar names and the title and abstract look interesting, leading to a winner-takes-all effect. 

This I think is a particularly important problem in research. At least my style is that I prefer not to work on ideas that are already very popular. I think to really make a deep breakthrough in research we need to see something that nobody else has noticed before. This often will involve getting insights from papers which have been forgotten or overlooked by the wider community. In this context crowdsourcing has the danger of leading to a groupthink in which the popularity of certain topics may exceed their promise, due to people working on them simply because many other people are also working on them. Thus, the choices of a few early adopters or "academic influencers" will end up getting amplified more and more.

The machine learning approach has the potential to give a more thorough and systematic coverage of the literature by being able to analyze all new papers and find the most interesting ones without being subject to this reliance on or sensitivity to initial fluctuations and early upwards. While there is promise of course, the big question is how are you supposed to develop a model to rank individual papers without studying deeply the science they contain? And how much can you trust a proprietary, closed-source model whose inner workings and potential biases are unknown?

For example, a crude first approximation might involve analyzing the references of a new paper to see what is cited. If a new paper cites important previous results (which can be estimated by how often they have been cited) then hopefully the paper will be worth reading. However simply counting the raw citations or references will be prone to bias. Different fields have different standards as to what is and should be cited. In some fields now you will see paper introductions which cite dozens or even hundreds of papers. In this case the value of an individual citation is relatively low and so looking at the citations alone we won't give much information as to what the paper is roughly about or whether it is worth reading.

It seems therefore that more sophisticated approaches are necessary. The context of a citation is important. Papers cited in bulk are not that valuable. For example, in an introductory paragraph "x has received a lot of interest lately [1-103]" only reveals that x is a hot topic. On the other hand, a sentence like "We apply the method of Ref. [16]..." tell us that there is probably a very close connection to whatever Ref. [16] is about. Thus, the integration of large language models such as BioGPT with paper recommendation systems is likely to improve their performance, thereby greatly improving the productivity of researchers who use them.

What do you think? Are there any other tricks for keeping up with the literature in your field?

No comments:

Post a Comment