Funding: HAP was supported by NLM Training Grant Number 5T15-LM007059-19. The NIH had no role in study design, data collection or analysis, writing the nafld fibrosis score, or the decision to submit it for publication.

The publication contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. Sharing information facilitates science. In addition to being used to confirm original results, raw data can be used to explore related or new hypotheses, particularly when combined with other publicly available data sets.

Real data is indispensable when investigating and developing study methods, analysis techniques, and software implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives, helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of funding and patient population resources by avoiding duplicate data collection. Believing that these benefits outweigh the costs of sharing research data, many initiatives actively encourage investigators to make their data available.

Since 2003, the NIH has required a data sharing plan for all large funding grants. The growing open-access publishing movement will perhaps increase peer pressure to share data. However, while the larger research community benefits from shared data, much of the burden for sharing the data falls to the study investigator. Are there benefits for the investigators themselves. A currency of value to many investigators is the number of times their publications are cited.

Boosting citation rate is thus is a potentially important motivator for publication authors. In this study, we explored the relationship between the citation rate of a publication and whether its data was made publicly available.

Using cancer microarray clinical trials, we investigated the following questions: Do trials which share their microarray data receive more citations. Is this true even within lower profile trials. What other data-sharing variables are associated with an increased citation rate. While this study is not able to investigate causation, quantifying associations is a valuable first step in understanding these relationships.

Clinical microarray data provides a useful environment for the investigation: despite being valuable for reuse and extremely costly to collect, is not yet universally shared.

The internet locations of the datasets are listed in Supplementary Text S2. The majority of datasets were made available concurrently with the trial publication, as illustrated within the WayBackMachine internet archives (www.archive.org). As seen in Table 1, trials published in high impact journals, prior to 2001, or with US authors were more likely to share their data. The 41 clinical trial publications which publicly shared their microarray data received more citations, in general, than the 44 publications which did not share their microarray data.

In this plot of the distribution of citation counts received by each publication, the extent of the box encompasses the investing roche holding range of the citation counts, whiskers extend to 1.

Detailed results of this multivariate linear regression are given in Table 2. We define papers published after the year 2000 in journals with an impact factor less than 25 as lower-profile publications. The distribution of the citations by data sharing in this subset is shown in Figure 2. For trials which were published after 2000 and in journals with an impact factor less than 25, the clinical trial publications which publicly shared their microarray data received more citations, in general, than the 43 publications which did not share their microarray data.

The number of patients in a trial and a clinical endpoint correlated with increased citation rate. However, the choice of platform was insignificant and only those trials located in SMD showed a weak trend of increased citations. In fact, the 6 trials with data in GEO (in addition to other locations for 4 of the 6) actually showed an inverse relationship to citation rate, though we hesitate to read much into this due to the small number of trials in this set. The few trials in this cohort which, in addition to gene expression fold-change or other preprocessed information, shared their raw probe data or actual microarray images did not receive additional citations.

Finally, although finding diverse microarray datasets online is non-trivial, an increase in citations was not noted for trials which mentioned their Supplementary Material within the abstract, nor for those trials with datasets identified by a centralized, established data mining website.

Perhaps with a larger and more balanced sample of trials with shared data these trends would be more clear. This result held even for lower-profile publications and thus is relevant to authors of all trials.

A parallel can be drawn between making study data publicly available and publishing a paper itself in an open-access journal. We note an important limitation of this study: the demonstrated association does not imply causation. Receiving many citations and sharing data may stem from a common cause rather than being directly causally related. Nonetheless, if we speculate for a moment that some or all of the association is indeed causal, we can hypothesize several mechanisms by which making data available may increase citations.

The simplest mechanism is due to increased exposure: listing the dataset in databases and on websites will increase the number of people who encounter the publication. Finally, these re-analyses may spur enthusiasm and synergy around a specific research question, indirectly focusing publications and increasing the citation sabril 500 mg of all participants.



