Recognizing bias in research data – and research data management

As the COVID pandemic grinds on, vaccinations are top of mind. A recent article published in JAMA Network Open examined whether vaccination clinical trials over the last decade adequately represented various demographic groups in their studies. According to the authors, the results suggested they did not: “among US-based vaccine clinical trials, members of racial/ethnic minority groups and older adults were underrepresented, whereas female adults were overrepresented.” The authors concluded that “diversity enrollment targets should be included for all vaccine trials targeting epidemiologically important infections.”

My colleague Rebecca Bryant and I recently enjoyed an interesting and thought-provoking conversation with Dr. Tiffany Grant, Assistant Director for Research and Informatics with the University of Cincinnati Libraries (an OCLC Research Library Partnership member) on the topic of bias in research data. Dr. Grant neatly summed up the issue by observing that data collected should be inclusive of all the groups who are impacted by outcomes. As the JAMA article illustrates, that is clearly not always the case – and the consequences can be significant for decision- and policy-making in critical areas like health care.

The issue of bias in research data has been acknowledged for some time; for example, the launch of the Human Genome Project in the late 1990s/early 2000s helped raise awareness of the problem, as did observed differences in health care outcomes across demographic groups. And efforts are underway to help remedy some of the gaps. One initiative, the US National Institutes of Health’s All of Us Research Program, aims to build a database of health data collected from a diverse cohort of at least one million participants. The rationale for the project is clearly laid out: “To develop individualized plans for disease prevention and treatment, researchers need more data about the differences that make each of us unique. Having a diverse group of participants can lead to important breakthroughs. These discoveries may help make health care better for everyone.”

Extrapolation of findings observed in one group to all other groups often leads to poor inferences, and researchers should take this into account when designing data collection strategies. The peer review process should act as a filter for identifying research studies that overlook this point in their design – but how well is it working? As in many other aspects of our work and social lives, unconscious bias may play a role here: lack of awareness of the problem on the part of reviewers means that studies with flawed research designs may slip through.

And that leads us to what Dr. Grant believes is the principal remedy for the problem of bias in research data: education. Researchers need training that helps them recognize potential sources of bias in data collection, as well as understand the implications of bias for interpretation and generalization of their findings. The first step in solving a problem is to recognize that there is a problem. Some disciplines are further along than others in addressing bias in research data, but in Dr. Grant’s view, there is still ample scope for raising awareness across campus about this topic.

Academic libraries can help with this, by providing workshops and training programs, and gathering relevant information resources. At the University of Cincinnati, librarians are often embedded in research teams, providing an excellent opportunity to share their expertise on this issue. Raising awareness about bias in research data is also an opportunity to partner with other campus units, such as the office of research, colleges/schools, and research institutes (for more information on how to develop and sustain cross-campus partnerships around research support services see our recent OCLC Research report on social interoperability).

Many institutions are currently implementing Equality, Diversity, and Inclusion (EDI) training, and modules addressing bias in research data might be introduced as part of EDI curricula for researchers. This could also be an area of focus for professional development programs supporting doctoral, postdoctoral, and other early-career researchers. It seems that many EDI initiatives focus on issues related to personal interactions or recruiting more members of underrepresented groups into the field. For researchers, it may be useful to supplement this training with additional programs that focus on EDI issues as they specifically relate to the responsible conduct of research. In other words, how do EDI-related issues manifest in the research process, and how can researchers effectively address them? A great example is the training offered by We All Count, a project aimed at increasing equity in data science.

Funders can also contribute toward mitigating bias in research data, by issuing research design guidelines on inclusion of underrepresented groups, and by establishing criteria for scoring grant proposals on the basis of how well these guidelines are addressed. The big “carrots and sticks” wielded by funders are a powerful tool for both raising awareness and shifting behaviors.

Bias in research data extends to bias in research data management (RDM). Situations where access to and ability to use archived data sets is not equitable is another form of bias. While it is good to mandate that data sets be archived under “open” conditions, as many funders already do, the spirit of the mandate is compromised if the data sets are put into systems that are not accessible and usable to everyone. It is important to recognize that the risk of introducing bias into research data exists throughout the research lifecycle, including curation activities such as data storage, description, and preservation.

Our conversation focused on bias in research data in STEM fields – particularly medicine – but the issue also deserves attention in the context of the social sciences, as well as the arts and humanities. Our summary here highlights just a sample of the topics worthy of discussion in this area, with much to unpack in each one. We are grateful to Dr. Grant for starting a conversation with us on this important issue and look forward to continuing it in the future as part of our ongoing work on RDM and other forms of research support services.

Like so many other organizations, OCLC is reflecting on equity, diversity, and inclusion, as well as taking action. Check out an overview of that work, and explore efforts being undertaken in OCLC’s Membership and Research Division. Thanks to Tiffany Grant, Rebecca Bryant, and Merrilee Proffitt for providing helpful suggestions that improved this post!

Brian Lavoie

Brian Lavoie is a Research Scientist in OCLC Research. He has worked on projects in many areas, such as digital preservation, cooperative print management, and data-mining of bibliographic resources. He was a co-founder of the working group that developed the PREMIS Data Dictionary for preservation metadata, and served as co-chair of a US National Science Foundation blue-ribbon task force on economically sustainable digital preservation. Brian’s academic background is in economics; he has a Ph.D. in agricultural economics. Brian’s current research interests include stewardship of the evolving scholarly record, analysis of collective collections, and the system-wide organization of library resources.

One Comment on “Recognizing bias in research data – and research data management”

Dr Andrew Khramtsovsky says:

April 21, 2021 at 11:07 pm

1) Research data are always biased. If you don’t know that, you are not a properly trained scientist.

2) Improving representation of different groups is one (good) thing. Politically motivated search for evil “biases” is another (vile) thing.

3) Many biases happen naturally. One needs sufficient additional funding, training, effort and – importantly – active cooperation of the underrepresented groups for closing the gaps. And you will never be 100% inclusive, so the professionally offended people will continue pestering you 🙂

4) In science, with its limited resources and tough deadlines, inclusivity should be fit for purpose of the specific research, supported within budget, – and definitely should not be a goal in itself.

Comments are closed.