Rewarding replication in science

I recently attended a workshop on open science. Open science is about making scientific processes more transparent, data freely available, and papers viewable by all. Among the many potential benefits of an open science model is increased confidence in scientific findings. This is particularly relevant in the midst of the ongoing replication crisis in many fields. The crux of the replication crisis is that many famous findings turn out to be very difficult to replicate. This raises the worrying implication that the original findings were just noise or were methodologically unsound.

Researchers are adopting at least two complementary strategies to deal with the problem. First, methodologies are being tightened up. Second, there is an active effort to increase the number of replication studies.

Ahead of the conference, I spent some time with David Hull’s Science as a Process. I think Hull’s model of incentives in science can shed some light on how we got here, and also suggests some possible paths out. The rest of this post lays out Hull’s model, speculates how it can explain the emergence of the replication crisis, and offers two suggestions (one modest, one ambitious) for increasing the supply of replication studies.

Hull’s model of scientific incentives

Hull, a philosopher of science, developed his ideas after close observation of a small scientific community (taxonomists and cladistics). These scientists did not behave like disinterested rational truthseekers:

As it turns out, the least productive scientists tend to behave the most admirably, while those that make the greatest contributions just as frequently behave the most deplorably. [p.32]

For Hull, humans everywhere desire status. Scientists are no exception. He sees the genius of science as redirecting the selfish desire for status into the creation new knowledge. It does this by creating a “market” for ideas.

The currency of this market is citation (or more broadly, use and influence). To get “rich” you need to generate knowledge that is used (i.e., cited) by your peers. In doing so, you gain status in the community. But the only way to generate new knowledge is to “buy” support from other knowledge by citing it.

For this to work, there have to be some kind of property rights in the creation of knowledge. This is achieved by the priority system: whoever publishes first owns the knowledge. This doesn’t mean they can prevent others from accessing it. Indeed, the priority system enforces prompt disclosure. But it does mean subsequent citations of the knowledge accrue to its original discoverer.

As Partha and David (1994) point out, the priority system has some attractive features. Knowledge can’t be “unveiled” twice. Once you have it, you have it. The priority system nudges scientists to work on unsolved problems, because you get no credit re-proving known results. There’s also a practical reason for the priority system. In any system where credit is assigned to people who are not first, it would be possible to get some credit by reading the first publication, pretending you had the same idea, and writing a paper.

For example, suppose we split credit for discovery among all scientists working on a problem. After all, many scientists are frequently working on the same problem and it seems unfair to assign all the credit to whoever was first. This assigns a lot of weight to mere luck. However, such a system could be gamed by fast-follower scientists. Once an idea has been published, anyone can read it and have the idea. What’s to stop them from claiming they were working on it at the same time? (Remember, Hull does not assume scientists are paragons of virtue)

There’s a second attractive feature of Hull’s model. People are not very good at seeking out information that challenges their beliefs, but they are pretty good at pointing out flaws in others’ beliefs. Again, scientists are no different. Scientists are poorly suited to looking for holes and mistakes in their own work, but the citation market in science separates the task of generating and testing knowledge.

Why would a scientist verify the ideas of someone else? First, scientists can get their knowledge more broadly used by eliminating the competition. This requires tearing down rival theories and findings. Even human foibles like grudges and personal animosity can be leveraged into doing useful work in this model:

“…I think that the function of personal animosity in science is still an open question. It might after all serve a variety of purposes…  Scientists acknowledge that among their motivations are natural curiosity, the love of truth, and the desire to help humanity, but other inducements exist as well, and one of them is to “get that son of a bitch.”  [p. 160]

Second, if you want to generate knowledge that is used, you can’t build on foundations of sand. There is therefore an incentive to check the work you build on.

Bubbles in Science

This system has, on the whole, worked really well. We know more about the world than ever before, and we learned most of during the few hundred years that we had science. But it’s not perfect, and the incentives for replication are one shortcoming.

Replication work isn’t free. It takes a lot of time and effort to replicate a study, in some cases nearly as much work as the original study (there’s a reason it’s not part of the peer review process). I suspect the cost of research in general has gone up over the century (a topic for another post), and the costs of replication probably went up alongside it.

Meanwhile, what is the benefit of a replication?  Or, in the parlance of Hull, how does replication work enhance your status in the community? The benefits of replication are a bit more complicated than costs, because there are direct and indirect benefits. Let’s consider each in turn.

First, replication studies might be directly used and cited. However, because of the priority system, little credit is assigned to successful replications of earlier work. While some might cite the original finding and note it was successfully replicated, not all will. And very few will cite the replication without the original finding.

A replication that challenges the original finding will probably get more attention than a successful replication, if only because it’s a more surprising result. But I worry failed replications won’t be cited as much as they should be. To the extent they are evidence against particular theories, they (and the now discredited original finding) are less likely to be cited by work in that theory. They may be cited by rival theories who seek to discredit competition. But I suspect this only occurs while a field is contested. Once one theory is ascendant, harping on the null results of the defeated theory seems at best a waste of time and at worst punching down. In contrast, a positive original finding might be cited for decades or more. (As an aside, the use of null results seems likely to be influenced by similar issues)

Second, replication work indirectly affects citations through it’s impact on what knowledge gets used. A failed replication of finding x redirects subsequent citations away from x and potentially towards rival ideas. A successful replication of finding x indicates x is sturdy enough to build on and encourages citation of x. In either case, the benefit is only realized by the replicator if they “own” rival ideas to x (in the case of failed replication) or if they own x or related ideas (in the case of successful replication).

Compared to original research, it seems like Hull’s model undervalues replication work. A replication that takes nearly as much work as an original finding, for example, is unlikely to get as many citations as an original finding. For any activity, when costs are high relative to benefits, people will gravitate towards substitutes. What are substitutes for replication work? One such substitute is the reputation of the scientist. Those who have done good work in the past will be trusted without as much verification in the future. Another substitute is to freeride on the judgment of the community. If at least some people are still doing independent verification, then highly cited work is more trustworthy. Both substitutes for replication contribute to a Matthew effect. The rich get richer.

These kinds of dynamics seem prone to bubbles. Suppose a paper by a famous author is mistaken, but this is never discovered because replication would be too much work. Instead, the paper is cited based on the strength of the author’s reputation. As it accrues more citations, other scientists interpret this as an endorsement by the community. It’s citations are further increased. This in turn raises the profile of the original author. Their subsequent work (likely in the same area) is now more likely to be cited without verification. In this way, an entire research edifice may be built on sand.

When the bubble pops, you get things like the replication crisis.

Existing proposals to increase replication

So what to do? There is currently a big push underway to increase the supply of replications in science. The open science project helps by reducing the cost of replication. It tries to make the research process more transparent, and makes the original finding’s data freely available.

Other proposals I’ve heard try to make replications easier to publish. One proposal is for journals to accept or reject papers based on pre-registered research plans rather than results. Journals favor “surprising” results, which makes them loathe to accept both null results and successful replications. The proposal would force journals to accept or reject based on a methodological plan, before the outcome is known. Another recent journal accepts replications that were submitted to good journals, but rejected because the results were not surprising. To make the process painless, the journal would accept the peer review reports from the rejecting journal. A third proposal would commit journals to paying for replications of a random sample of accepted papers. Other proposals would integrate replications into graduate school training.

A modest proposal

I see no problem with trying these. But Hull’s model suggests the root of the problem is that replication work does not confer status. This is because it is not likely to be cited. My modest proposal is to increase the number of citations to replications by empowering journal editors to add them.

There is already precedent for this in an adjacent knowledge creation field. Patents also have to cite relevant “prior art,” whether in the form of other patents or not. Patent examiners can and do add citations to patents (omitted by the applicant) if they feel the citation is relevant. Today, these examiner-added-citations are indicated on US patents with an asterisk, but they are in every other way treated as a “normal” citation.

The format is not important, but I imagine this could look like figure 1. Replication studies are listed below the original finding, indented and perhaps accompanied by an asterisk (to indicate they were added by the editor).

citation exampleFigure 1. Possible editor-added replication citation format

Would these citations confer status? I think so. Just as citations to original findings bolster a paper’s support, so too do citations to replications of those findings. Moreover, just like every other citation in the paper, they would credit the replication author by name. Furthermore, to the extent that a scientist’s career is summarized by various citation metrics (h-index, total sum, euclidian norm), these citations would “count” just as much as the rest. And finally, to the extent that journals chase citations too, an increase in citations to replications would increase their willingness to accept replications for publication.

I think this would help. Over time, if enough replication happens, a new norm may emerge in science wherein replications are cited without input from journal editors (encouraging such a norm has been suggested by Coffman, Niederle and Wilson). But it’s not a perfect system either. Which brings me to a second proposal.

A not-so-modest proposal

One problem with the above, is it creates scope for replicators to “free-ride” on highly cited papers. Think, for example, of hordes of graduate students running the same code on the same data, both made available by the original authors of the finding. You could end up with dozens of “replications” that add little value. This issue could be mitigated by editor discretion and the difficulty in judging what papers will be highly cited. But a bigger problem is that the above solution does little to address the problems of replications that challenge the original finding. These failed replications often push scientists into completely different research areas. There are no papers left to add citations to.

A more ambitious proposal is to try to estimate the marginal impact of a replication study on the original finding’s citations. For clarity, suppose a paper’s citations can be expressed by a function c(f,t), where f is the set of paper features that impact citation and t is the probability that the paper is “true.” This function is estimated empirically, with t possibly corresponding to the probability a replication effort matches the original finding. I assume c(f,t) is increasing in t.

The value of a replication study is given by:

replication value = | c(f,t’) – c(f,t) |

where t’ is the updated probability of truth after a replication (using Bayes rule). The replication value is the absolute value of the change in the original finding’s citations induced by the replication.

This formula has a number of nice properties. Successful replications raise t and their value is equal to the increase in citations associated with the rise in t. Conversely, when the replication challenges the original finding, the original finding receives fewer citations and the replication is rewarded for directing research away from the area. In either case, the value is larger for original findings with feature sets f such that they are highly cited if true.

Second, if we use Bayes’ rule to update our estimate of truth, early replications move t a lot, and are rewarded accordingly (a sort of generalized priority system). It may also be possible to design ways of incorporating the quality of the replication (for example, bigger samples might count as better evidence), so that high quality ones have a bigger impact on t’. Along the same lines, if we can measure the correlation between different replication efforts, this could also be incorporated. Replications using the same dataset will likely have outcomes highly correlated with the original finding, and therefore provide less evidence (they move t’ less) than those that gather new data. All of this would serve to nudge scientists towards the most valuable replication work.

Would these replication value scores have any impact on scientist status? I admit this is less clear. When  c(f,t’) – c(f,t) > 0, the replication value could be used in conjunction with the first proposal to decide which replications to cite on any paper. In this way, it would really “get names on papers.” A more challenging case is when  c(f,t’) – c(f,t) < 0. In this case, authors may be hostile to having negative evidence added to their citations. Moreover, there may not be any papers citing the original finding that the replication can be attached to!

At a minimum, however, the replication value could be reported alongside other citation based indicators. Moreover, although I have not emphasized it in this post, citations are not the only way to obtain status in science. Professional recognition can take many forms: promotions, prizes, fellowships, appointment to the leadership of professional societies, etc. Replication value could be used to decide who gets recognized.

However, the main advantage of this approach is that it can be unilaterally implemented. There is nothing to stop a motivated individual with access to the right data on citations to estimate their own c(f,t) and begin posting the replication value of different studies on the web. The main difficulty is probably in coming up with an estimate of c(f,t) that is convincing to the research community.

But that’s why I called this proposal not-so-modest.

Leave a Reply

Your email address will not be published. Required fields are marked *