Linked Data Survey results 6 – Advice from the implementers

 

 

Linked Data Cloud 2014-04OCLC Research conducted an international linked data survey for implementers between 7 July and 15 August 2014. This is the sixth-and last-post in the series reporting the results.  

An objective in conducting this survey was to learn from the experiences of those who had implemented or were implementing linked data projects/services.  We appreciate that so many gave advice. About a third of those who have implemented or are implementing a linked data project are planning to implement another within the next two years; another third are not sure.

Asked what they would differently if they were starting their project again, respondents answered with issues clustered around organizational support and staffing, vocabularies, and technology. One noted that legal issues seriously delayed the release of their linked data service and that legal aspects need to be addressed early.

Organizational support and staffing:

  • Have a clear mandate for the project. Our issues have stemmed from our organization, not the technology or concept.
  • It would have been useful to have a greater in-house technical input.
  • With hindsight we have more realistic expectations. if funding would allow I would hire a programmer to the project.
  • Attempt to garner wider organisational support and resources before embarking on what was in essence a very personal project.
  • We also would have preferred to have done this as an official project, with staff resources allocated, rather than as an ad-hoc, project that we’ve crammed into our already full schedules.
  • Have dedicated technical project manager – or at least a bigger chunk of time.
  • Have more time planned and allocated for both myself and team members.

Vocabularies

  • Build an ontology and formal data model from the ground up.
  • Align concepts we are publishing with other authorities, most of which didn’t exist at the time.
  • Vocabulary selection, avoid some of the churn related to that process.
  • Make more accurate and detailed records so that it is easier for people using the data to clear up ambiguity of similar names.
  • I might seek a larger number of partners to contribute their controlled vocabularies or thesauri in advance.

Technology

  • We would immediately use Open Refine to extract and clean up data after the first export from Access
  • We would provide a SPARQL endpoint for the data if we had the opportunity.
  • We would give more thought to service resilience from the perspective of potential denial of service attacks.
  •  Well define the schema first before we generated the records. Use the schema to validate all of the records before we stored them in the system’s database.
  • It is still a pity that the Linked Data Pilot is not more integrated to the production system. It would have easier if the LOD principles would have been included in this production system from the beginning.
  • We might have done more to help our vendor understand the complexity of the LCNAF data service as well as the complexity of the MARC authority format.
  • Better user experience; we chose to focus on data mining vs data use.
  • Transforming the source data into semantic form, before attempting process (clustering, clean up, matching).
  • A stable infrastructure is vital for the scalability of the project.

General advice

Much of the advice for both those considering projects to consume linked data and those considering projects to publish linked data cluster around preparation and project management:

  • Ask what benefit doing linked data at all will really have.
  • There is more literature and online information relating to consuming linked data than there was when we started so our advice would be to read as widely as possible and consult with experts in the community.
  • Get a semantic web expert in the team
  • The same as any other project: have a detailed programme.
  • Have a focus. Do your research. Overestimate time spent.
  • Take a Linked Data class
  • Estimate the time required for the project and then double it.  The time to explain details of MARC, EAD, and local practices and standards to the vendor, to test functionality of the application, and to test navigational design elements of the application all require dedicated blocks of time.
  • Bone up on your tech skills.  It’s not magic; there is no wand you can wave.
  • Basic project management, basic data management, basic project planning are really important at the onset.
  • Having a detailed program before starting. Get institutional commitment.  Unless the plan is to do the smallest thing… the investment is great enough to warrant some kind of administrative blessing, at the minimum.
  • Take advantage of the many great (and free) resources for learning about RDF and linked data.
  • Start with a small project and then apply the knowledge gained and the tools built to larger scale projects.
  • Find people at other institutions who are doing linked data so you can bounce ideas off of each other.
  • Plan, plan, plan! Do research. Understand all that there is going on and what challenges you will have before you reach them.
  • Automate, automate, automate

Advice for those considering a project to consume linked data

  • Linking to external datasets is very important but also very difficult.
  • Find authorities for your specific domain from the outset, and if they don’t exist don’t be afraid to define and publish your own concepts.
  • Firm understanding of ontologies
  • Use CIDOC CRM / FRBRoo for cultural heritage sources. It will be far more costs effective and provide the highest quality of data that can be integrated preserving the variability and language of your data.
  • Pick a problem you can solve. Start with schema.org as core vocabulary. Lean toward JSON-LD instead of rdfxml. Like agile fail quick and often. Store the index in a triplestore.
  • Make a decision what kind of granularity of data you want to make available as linked data – no semantics for now. We cannot make our data to transform as linked data as one to one relationship – there should be a data that will not be available in linked data. If you want to make your data discoverable, then schema.org semantic will work the best.
  • Sometimes the data available just won’t work with your project. Keep in mind that something may look like a match at first but the devil is in the details. 

Advice for those considering a project to publish linked data

General advice: “Try to consume it first!”

Project management

  • It’s possible to participate in linked data projects even by producing data and leaving the work of linking to others.
  • Managing expectations of content creators is tough – people often have expectations of linked data that aren’t possible. The promise of being able to share and link things up can efface the work required to prepare materials for publication.
  • Always look at what others have done before you. Build a good relationship with the researcher with whom you are working; leverage the knowledge and experience of that person or persons. Carefully plan your project ahead of time, in particular the metadata.
  • Look at the larger surrounding issues.  It is not enough to just dump your data out there.  Be prepared to perform some sort of analytics to capture information as to uses of the data.  Also include a mechanism for feedback about the data and requested improvements/enhancements.  The social contract of linked data is just as important as the technical aspects of transforming and publishing the data.
  • Just do it, but consider if you’re just adding more bad data to the web — dumping a set of library records to RDF is pointless. Consider the value of publishing data. Reusing data is probably more interesting.
  • The assumption that the data needs to be there in order to be used is, I think, wrong. The usefulness of data is in its use; create a service one uses oneself and it is valuable and useful. Whether others actually use it is irrelevant.
  • Pay attention to reuse existing ontologies in order to improve interoperability and user comprehension of your published data. 

Technical advice

  • Publish the highest quality possible that will also achieve semantic and contextual harmionisation. You will end up doing it again otherwise and therefore it is far more cost effective and gets the best results.
  • Don’t use fixed field/ value data models. For cultural heritage data use CIDOC CRM / FRBRoo.
  • Offer a SPARQL endpoint to your data.
  • Use JSON-LD.
  • Museums need to take a good look at their data and make sure that they create granular data, i.e. each concept (actors, keywords, terms, objects, events, …) needs to have unique ids, which in turn will be referenced in URIs. Also publishing linked data means embracing a graph data structure, which is a total departure from traditional relational data structure: linked data forces you to make explicit what is only implicit in the database.  Modeling data for events is challenging but rewarding. Define what data entities your museum is responsible for… Being able to define URIs for entities means being able to give them unique identifiers and and there are many data issues that need to be taken care of within an institution.  Also, very important is that producing LOD requires the data manager to think differently about data, and not about information.  LOD requires that you make explicit knowledge that is only implicit in a traditional relational database.

 Recommended Resources

This is a compilation of resources–conferences, linked data projects, listservs, websites–respondents found particularly valuable in learning more about linked data.

Conferences valuable in learning more about linked data: American Medical Informatics Association meetings,  Computer Applications in Archaeology, Code4Lib conferences, Digital Library Federation’s forums, Dublin Core Metadata Initiative, European Library Automation Group, European Semantic Web Conferences, International Digital Curation Conference, International Semantic Web Conference, Library and Information Technology Association’s national forums, Metadata and Digital Object Roundtable (in association with the Society of American Archivists), Scholarly Publishing and Academic Resources Coalition conferences, Semantic Web in Libraries, Theory and Practice of Digital Libraries

Linked data projects implementers track:

  • 270a Linked Dataspaces
  • AMSL, an electronic management system based on linked data technologies
  • Library of Congress’ BIBFRAME (included in the survey responses)
  • Bibliothèque Nationale de France’s Linked Open Data project
  • Bibliothèque Nationale de France’s OpenCat: Interesting data model – lightweight FRBR model together with reuse of commonly used web ontologies (DC; FOAF, etc.); scalable open source platform (cubicweb). Opencat aims to demonstrate that data published on data.bnf.fr can be re-used by other libraries, in particular public libraries.
  • COMSODE (Components Supporting the Open Data Exploitation)
  • Deutsche National Bibliothek’s Linked Data Service
  • Yale Digital Collections Center’s Digitally Enabled Scholarship with Medieval Manuscripts, linked data-based.
  • ESTC (English Short-Title Catalogue): Moving to a linked data model; tracked because one of the aims is to build communities of interest among researchers.
  • Libhub: Of interest because it has the potential to assess the utility of BIBFRAME as a successor to MARC21.
  • LIBRIS, the Swedish National Bibliography
  • Linked Data 4 Libraries (LD4L): “The use cases they created are valuable for communicating the possible uses of linked data to those less familiar with linked data and it will be interesting to see the tools that are developed as a result of the projects.” (Included in the survey responses)
  • Linked Jazz: Reveals relationships of the jazz community, something similar to what a survey respondent wants to accomplish.
  • North Carolina State University’s Organization Name Linked Data: Of interest because it demonstrates concepts in practice (included in the survey responses).
  • Oslo Public Library’s Linked Data Cataloguing: “It is attempting to look at implementing linked data from the point of view of actual need… of a real library for implementation. Cataloguing and all aspects of the system will be designed around linked data.” (Included in the survey responses)
  • Pelagios: Uses linked data principles to increase the discoverability of ancient data through place associations and a major spur for a respondent’s project.
  • PeriodO:  A gazetteer of scholarly assertions about the spatial and temporal extents of historical and archaeological periods; addresses spatial temporal definitions.
  • Spanish Subject Headings for Public Libraries Published as Linked Data (Lista de Encabezamientos de Materia para las Bibliotecas Públicas en SKOS)
  • OCLC’s WorldCat Works (included in the survey responses)

Listservs: bibframe@listserv.loc.gov (Bibliographic Framework Transition Initiative Forum), Code4lib@listserv.nd.edu, DCMI (Dublin Core Metadata Initiative) listservs, data-ac-uk@jiscmail.ac.uk,  dlf-announce@lists.clir.org (Digital Library Federation), lod-lam@googlegroups.com, public-ldp@w3.org (linked data platform working group), semantic-web@w3.org

Websites:

Analyze the responses yourself!

If you’d like to apply your own filters to the responses, or look at them more closely, the spreadsheet compiling all survey responses (minus the contact information which we promised we’d keep confidential) is available at: http://www.oclc.org/content/dam/research/activities/linkeddata/oclc-research-linked-data-implementers-survey-2014.xlsx