PageRank in academic publishing

The standard measure scientists use to judge the importance of scientific papers is a simple citation count. That is, how many other papers cite the paper in question? While this measure has its merits, it has one fundamental flaw – not all citations are equal. For example, if a paper I wrote receives a citation from a highly influential Nature paper, that should carry more weight than a citation from the New England Journal of Who Gives A Crap. So what the scientific community needs to do is embrace a better measure that takes into account the importance of a citation. Numerous authors/bloggers have advocated using a PageRank-like index for quantifying the importance of papers or journals (here, here, here and here). In this article I’d like to throw my support behind this suggestion. I’ll begin by explaining the PageRank index and how it is calculated, and discuss why this approach is superior to a simple citation count.

The PageRank index was developed by Larry Page and Sergey Brin as a way of indexing the relevance of web-pages for their Google search engine (incidentally, PageRank is named after Larry Page, not web-page). Here I’ll describe the PageRank algorithm in simple terms, ignoring many of the details. For a more detailed explanation I recommend the Wikipedia entry. The calculation begins by representing the web as a graph. A graph is a mathematical object consisting of a bunch of vertices (dots), arbitrarily connected by a bunch of edges (lines). So a graph can be thought of as a completed dot-to-dot game. Here’s an example of a graph that I pulled from the Wikipedia.

To represent the web we use a directed graph, where the edges carry a direction. In the graph shown above the circles (i.e. vertices) represent web-sites, and the edges represent links between them. So, for example, ‘A’ might represent www.cnn.com, and ‘D’ might represent news.bbc.co.uk. In this instance, the directed edge from D to A means that there is a link on news.bbc.co.uk to www.cnn.com.

The goal of the PageRank algorithm is two-fold. We wish to construct a measure of relevance that, first, is related to how many incoming links a site has, and second, what the importance of the source of those links was. So, edges can be regarded as a ‘vote’ for a site, but the impact of the vote should be proportional to the importance of the source of the vote.

The way PageRank calculates this is by iteratively applying ‘vote casting’ between sites. We begin by initializing every vertex with the same score, say 1. We then iterate through the vertices and pass PageRank points to the sites linked to by the respective site. The number of points cast is given by the originating site’s PageRank, divided by the the total number of outgoing links. So the weight of the votes cast is proportional to the PageRank of the site casting the vote. For example, in the graph shown above, F would cast 3.9/2 votes to B, and 3.9/2 votes to E. While C would cast the entire 34.3 votes to B. By repeating many such iterations, and renormalizing the scores so as to prevent blow-out, the PageRanks of the sites converge to constant values. What we are left with is a set of scores that reflects not just how many sites link to a given site, but also what the importance of those links was. [An alternate way of thinking about the PageRank index is that it can be represented as a coupled system of flow equations, where the ‘flow’ represents the flow of votes. The PageRanks are given by the steady state solution to this system of equations.]

So this is how PageRank evaluates the relative importance of web-sites. What about scientific papers? Well scientific papers can be mapped to a graph in a similar way to web-sites. Specifically, vertices in the graph would represent papers, and edges citations. The PageRank algorithm can be applied out-of-the-box.

There are a few tweaks and variations that one could apply to a paper-based PageRank index. First of all, one could discount self-citations from the index (i.e. an anti-‘bombing’ mechanism). This isn’t possible with web-pages because the authorship of web-sites is typically not public. But, with scientific papers there is always an authorship list attached to every paper. I think this tweak is an important one since self-citations inevitably bias citation counts. For example, when I look back at the citation counts for my own papers I see that my very first paper has by far the most citations. 90% or them are self-citations 🙁 Should these citations really influence a credible measure of the importance of that paper? Probably not, since they are clearly biased. A second variation that one might try is to add a time bias when calculating the index, such that links from more recent papers carry more weight than from older papers. This might make a valuable secondary index as it would reflect how important a paper is presently rather than historically. Again, this is something that is not possible with web-sites, since they are not time-stamped, but it is possible with papers.

In summary, a PageRank-based measure for the impact of scientific papers would address the problem that some citations are more valuable than others. It would more closely reflect the impact that a given paper has had on a field than a simple citation count. What’s left is for someone to set up a site calculating the PageRanks of papers and making them publicly available. Google?

If you enjoyed this article, please consider sharing it!
Icon Icon Icon

Related Posts