That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by John Riemer of UCLA. With increasing expectations that research data creation made possible through grant funding will be archived and made available to others, many institutions are becoming aware of the need to collect and curate this new scholarly resource. To maximize the chances that metadata for research data are shareable (that is, sufficiently comparable) and helpful to those considering re-using the data, our communities would benefit from sharing ideas and discussing plans to meet emerging discovery needs. OCLC Research Scientist Ixchel Faniel’s two-part blog entry “Data Management and Curation in 21st Century Archives” (Sept 2015) provided useful background to this discussion.
The discussions revealed a wide range of experiences, from those just encountering researchers who come to them with requests to archive and preserve their research data to those who have been handling research data for some years. National contexts differ. For example, our Australian colleagues can take advantage of Australia’s National Computational Infrastructure for big data and the Australian Data Archive for the social sciences. Canada is developing a national network called Portage for the “shared stewardship of research data”.
The US-based metadata managers were split about whether to have a single repository for all data or a separate repository for research data, although there seems to be a movement to separate data that is to be re-used (providing some capacity for computing on it) from data that is only to be stored. A number of fields have a discipline-based repository, or researchers take advantage of a third-party service such as DataCite, also used for discovery. The library can fill the gap for research data without a better home.
Recently-published Building Blocks: Laying the Foundation for a Research Data Management Program includes a section on metadata:
Datasets are useful only when they can be understood. Encourage researchers to provide structured information about their data, providing context and meaning and allowing others to find, use and properly cite the data. At minimum, advise researchers to clearly tell the story of how they gathered and used the data and for what purpose. This information is best placed in a readme.txt file that includes project information and project-level metadata, as well as metadata about the data itself (e.g., file names, file formats and software used, title, author, date, funder, copyright holder, description, keywords, observation unit, kind of data, type of data and language).
A number of institutions have developed templates to capture metadata in a structured form. Some metadata managers noted the need to keep such forms as simple as possible as it can be difficult to get researchers to fill them in. All agreed data creators needed to be the main source of metadata. But how to inspire data creators to produce quality metadata? New ways of training and outreach are needed.
We also had general agreement on the data elements required to support re-use by others: licenses, processing steps, tools, data documentation, data definitions, data steward, grant numbers and geospatial and temporal data (where relevant). Metadata schema used include Dublin Core, MODS (Metadata Object Description Schema) and DDI (Data Documentation Initiative’s metadata standard). The Digital Curation Centre in the UK provides a linked catalog of metadata standards. The Research Data Alliance’s Metadata Standards Directory Working Group has set up a community-maintained directory of metadata standards for different disciplines.
The importance of identifiers for both the research data and the creator has become more widely acknowledged. DOIs, Handles and ARKs (Archival Resource Key) have been used to provide persistent access. Identifiers are available at the full data set level and for component parts, and they can be used to track downloads and potentially help measure impact. Both ORCID (Open Researcher and Contributor ID) and ISNI (International Standard Name Identifier) are in use to identify data creators uniquely.
Some have started to analyze the metadata requirements for the research data life cycle, not just the final product. Who are the collaborators? How do various projects use different data files? What kind of analysis tools do they use? What are the relationships of data files across a project, between related projects, and to other scholarly output such as related journal articles? The University of Michigan’s Research Data Services is designed to assist researchers during all phases of the research data life cycle.
Curation of research data as part of the evolving scholarly record requires new skill sets, including deeper domain knowledge, data modeling, and ontology development. Libraries are investing more effort in becoming part of their faculty’s research process and offering services that help ensure that their research data will be accessible if not also preserved. Good metadata will help guide other researchers to the research data they need for their own projects—and the data creators will have the satisfaction of knowing that their data has benefitted others.
Karen Smith-Yoshimura, senior program officer, works on topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements.