Linked Data Survey results 4–Why and what institutions are publishing (Updated)



LOD_Cloud_Diagram_as_of_September_2011 wikimedia.orgOCLC Research conducted an international linked data survey for implementers between 7 July and 15 August 2014. This is the fourth post in the series reporting the results. 

By far, the two main reasons why the 51 linked data projects/services described in the survey that publish linked data are: to expose their data to a larger audience on the Web (45) and to demonstrate what could be done with their datasets as linked data (41). Other reasons in descending order: heard about linked data and wanted to try it out by exposing their data as linked data; to improve Search Engine Optimization (SEO); the institution’s administration requested that the data be exposed as linked data; a national initiative; needing to consume their own linked data; making data available to suppliers; creating a research environment; wanting to make metadata open for future European legislative compliance; wanting to support the initiative and represent “things not strings”.

The types of data published as linked data by the described projects (in descending order):

  • Descriptive metadata
  • Bibliographic data (e.g., MARC records)
  • Authority files
  • Digital collections
  • Encoded Archival Descriptions/Archival finding aids
  • Ontologies/controlled vocabularies
  • Statistical data
  • Data about museum objects
  • Geographic data
  • Holdings data

The largest linked data datasets reported in descending order:

Project/Service Source Size OCLC 15 billion triples Works OCLC  5 billion triples
Europeana Europeana Foundation  3,798,446,742 triples for the pilot LOD service
The European Library The European Library 2,131,947,229 triples
Linked Open Data Research Libraries UK 936,054,853 triples
Opendata Charles University in Prague 750 million triples
Semantic Web Collection British Museum 100-500 million triples
Drug Encyclopedia Charles University in Prague 100-500 million triples
Cedar project DANS 100-500 million triples Library of Congress 100-500 million triples
British National Bibliography British Library   50-100 million triples
British Art collection Yale Center for British Art   57 million triples

Licenses: 16 projects/services do not announce any explicit license. Of the ones that do, CC0 1.0 Universal is the most common (15), followed by Public Domain Dedication and License or PPDL (7).  A few apply Open Data Commons Attribution (ODC-BY), Open Data Commons Open Database License (ODC-ODbl) or ODC-ShareAlike Community Norms. Other licenses mentioned: ODC-BY-SA (Attribution-ShareAlike), but considering ODC-BY; Creative Commons Attribution-NonCommercial-NoDerivatives (BY-NC-ND); French State Open License-based; and a link to a specific set of data services terms.In addition, the Digital Public Library of America reported its dataset as 50-100 million triples, but it’s not in production yet.

RDF Vocabularies and Ontologies:  The RDF Vocabularies and Ontologies used the most:

  • SKOS – 38
  • FOAF – 30
  • Dublin core terms – 29
  • Dublin core – 27
  • –  22

Here’s the alphabetical list of the RDF Vocabularies and Ontologies used; those that include uses by Dewey, FAST, ISNI, VIAF, and Works are asterisked.

 RDF Vocabulary & Ontology # ofProjects Dewey FAST ISNI VIAF Works
Archival ontology 1
Bibframe 6
Bibliographic Ontology 14
Biographical Ontology 7
British Library Terms 4
Dublin Core 27   *   *   *
Dublin Core Terms 29   *   *       *
European Data Model Vocabulary 8
Event Ontology 4
FOAF 30    *    *   *  *       *
Organization Ontology 10
RDA 10   *   * 22   *   *       *       *
SKOS 38    *   *   *   *       *
WGS84 Geo Positioning 11    *   *
Other 34   *       *       *

Other vocabularies or ontologies cited:

  • CERIF semantic vocabularies
  • CIDOC-CRM (8 projects mentioned this)
  • Data Catalog Vocabulary
  • DPLA metadata application profile
  • Ecological Metadata Language (EML)
  • Fabio (FRBR-aligned bibliographic ontology)
  • FRBR
  • FRBRer model
  • ISNI
  • Local vocabulary
  • Metadata Object Description Schema
  • Music Ontology
  • ontology
  • OAI ORE Terms
  • OWL
  • Owl 2 Web Ontology Language
  • Radatana
  • rdaGr2
  • RDF Schema
  • Reviews Ontology
  • Semantic Web for Research Communities
  • SIOC (Semantically-Interlinked Online Communities)
  • VIVO Core

Barriers or challenges encountered in publishing linked data

Learning enough:

  • Little documentation or advice on how to build the systems, so had to wing it.
  • Steep learning curve for staff to get to grips with Semantic Web technologies.
  • So many serializations and understanding how/when to use them.
  • Developing technology made it challenging to select options or identify best practice.
  • Determining the workflow – lots of planning, discussion, trial and error, looked at what was done for other projects.
  • The audience – social historians – which is supposed to use them is usually not trained to query SPARQL endpoints.
  • Learning the various RDF serializations and writing XSLT to convert data into those serializations took some time. One of the largest challenges was not having examples of similar small-scale projects from other libraries to use as a reference.
  • Selecting appropriate ontologies to represent our data. We worked closely with our Metadata department on choosing ones that we agreed were the most appropriate, or widely used.
  • This particular project hinges on the curators describing the relationships, and it can be difficult to explain the Linked Data framework to non IT/LD experts.
  • Dealing with multiple agencies which were exporting and then importing the metadata; the former had no experience working with bibliographic linked data, and the latter were run by non-librarians. This led on occasion to misunderstandings of intention and purpose.

Data cleanup:

  • Inconsistency in legacy data – the project highlighted some issues with legacy data. This provided us with an opportunity to enhance our legacy data in order to improve conversion.
  • It has been very difficult to disaggregate data correctly, until we switched to Open Refine.
  • How to handle version control
  • Cleaning the data. Establishing the links; we did this partially automatically (by string matching) with manual correction.

Technical issues:

  • It can be difficult to ascertain who owns data; we have had to remove one dataset…because its owners changed their minds about publishing it.
  • A lot of software is not mature, so there is a challenge to keep up with updates, changing components as better software becomes available.
  • Size of the file makes it difficult to consume.
  • At the moment the end-user tools for data entry and annotation of knowledge graphs are sparse. Also again, there are no tools in production for data modeling that do not require heavy IT support.
  • To allow our data to be used outside of the library industry, we need to provide our data in a widely understood descriptive vocabulary.  SKOS is too library specific for this.  We have chosen for most of the information, but its orientation is more commercial than descriptive, which makes the data less useful.
  • Integration with existing web app infrastructure.

[Originally posted 2014-09-03, updated 2014-09-04]

Coming next: Linked Data Survey resultsTechnical details



One Comment on “Linked Data Survey results 4–Why and what institutions are publishing (Updated)”

Comments are closed.