That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Steven Folsom of Cornell University and Stephen Hearn of the University of Minnesota. Metadata practitioners can cite many examples of established vocabularies or datasets that have become outdated or do not provision for local needs or sensibilities. Slow or unresponsive maintenance models for established vocabularies have tempted some of us to consider distributed models. High training thresholds to participate in current models have contributed to desiring alternatives. The new PCC Strategic Directions 2018-2021 document cites a more diverse vocabulary landscape.
In theory, linked data would provide the means for local communities to prefer a different label for an established vocabulary’s preferred term for a concept or entity. One might want to reference a local description of a concept or entity not represented (or not represented satisfactorily) in established vocabularies or linked data sources. If these kinds of amendments and additions are made possible in a linked data environment, then others can agree (or not) with the point of view by linking to the new resource. Such a distributed model for managing both terminology and entity description raises issues around metadata stability expectations, metadata interoperability, and metadata maintenance.
We noted that there are social aspects to this issue, not just technical. Numerous vocabularies were created for specific projects that, once funding stopped, remained frozen in time. Observed one discussant: “Nothing is sadder than a vocabulary that someone invented that was left to go stale.”
OCLC Research Library Partners metadata managers discussed whether the impediments for a distributed model were low enough to be viable, and whether the benefits would outweigh the challenges. The most common concerns raised about a distributed model:
- Stability and versioning
- Notifications of changes
- Semantics and their alignments
- Redundancy—how to prevent people spending time working on the same entity?
- How to feed local vocabularies into the general environment?
The requirements for distributed vocabulary maintenance converged around:
- Communities of practice need a hub to aggregate and reconcile terms within their own domains. It was noted that different communities of practice might use terms that conflict with others’ terminologies, or mean different things. The Cornell-led IMLS grant on shared vocabularies is producing a white paper and models for reconciliation and aggregation, describing the different pieces that need to be put in place for linked data to proceed. However, this focused on names, which we all agreed are much easier to deal with than reconciling concepts. Similar ground was covered by the PCC Linked Data Advisory Committee’s Linked Data Infrastructure Models: Areas of Focus for PCC Strategies.
- Support syndetic relationships among different vocabularies.
- Replacing text strings with stable, persistent identifiers would facilitate using different labels depending on context. This would accommodate both different languages and scripts (and different spellings within a language, such as American vs. British English), as well as terms that are more respectful to marginalized communities. We referred to the OCLC Research Works in Progress webinar on Decolonizing Descriptions: Finding, Naming and Changing the Relationship between Indigenous People, Libraries, and Archives which described the process launched by the Association for Manitoba Archives and the University of Alberta Libraries to examine subject headings and classification schemes and consider how they might be more respectful and inclusive of the experiences of indigenous people.
- Communicate the history of changes and the provenance of each new or modified term. Such transparency would contribute to the trustworthiness of the source. The edit history and discussion pages that are included in each Wikidata itemWik is a possible model to follow.
- The model must be both scalable and extensible. The model needs to accommodate the proliferation of new topics and terms symptomatic of the humanities and sciences, and facilitate contributions by the researchers themselves. It needs to be flexible enough to co-exist with other vocabularies.
Wikidata is an example of a successful model of contributions from a wide variety of different communities. Wikidata is derived from data in Wikipedias, and currently lacks conceptual models for creative works, organizations, and concepts. But it handles person entities well, providing a variety of labels in different languages and scripts and aggregating multiple identifiers referring to the same entity. See, for example, its entry for Jane Austen. Some of the Wikidata attributes might need to be kept local rather than shared to protect the privacy of living persons, however, such as gender, date of birth, and contact information.
Open questions around distributed vocabularies included:
- Who would take ownership and responsibility to provide stability?
- How could you verify the provenance?
- If we no longer rely on governmental bodies for vocabulary management, what alternatives would there be to measure stability?
Expanding vocabularies to include those used in other communities requires building trust relationships. Our discussions converged around the need for a model of “community contribution” for new terms and community voting. If a concept or term becomes controversial, an authorized editorial group would need to step in and mediate. We also need to acknowledge that our current “consensus environment” excludes a lot of people. Requiring provenance as part of a distributed vocabulary model may help us in creating an alternative environment.
Karen Smith-Yoshimura, senior program officer, works on topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements.