Analysis Methodology for Museum Data

In a previous post, I’ve shared some background about the data analysis phase of our Museum Data Exchange Mellon grant, and posted some of the questions our museum participants wanted to have answered. In the meantime, we have created a spreadsheet [pdf] which captures our ideas to date of what questions we may want to ask of the 850K CDWA Lite XML records from 9 museums. Note that the methodology captured by this spreadsheet lays out a landscape of possibilities – it is not a definitive checklist of all the questions we will answer as part of this project. Only as we get deeper into the analysis will we know which questions are actually tractable with the tools we have at hand. I’d appreciate any thoughts on additional lines of inquiry we could pursue with our analysis, or other observations!

Since the spreadsheet attempts to structure the questions, I thought I’d provide a little gloss on the organizing principles at work.

The columns of the spreadsheet should be largely self-explanatory. Test Method tries to give a sense of whether we expect a test is machine-drive or requires human intervention, and that may be a gage for what kind of labor is required to make the test work. Primary focus limits our investigation by focusing on the most meaningful subsets of CDWA Lite, i.e. required or highly recommended data elements. (The totality of CDWA Lite consists of 134 possible units of information.)

You’ll also see along the left side of the spreadsheet that we’ve organized the questions into several categories:

The Metrics section deals with questions which have objective, factual answers.

  • In this section, we’ll ask questions about Conformance: Does the data conform to what the applicable standards (CDWA Lite and CCO) stipulate? Since CCO conformance only lends itself to a limited amount of machine testing, and requires a considerable amount of expertise, we have hired CCO co-author Patricia Harpring and Antonio Beecroft (both from the Getty Research Institute) to spend some of their weekend and vacation time to evaluate the CCO-ness of the data.
  • We’ll also ask questions about Connections: what relationships between records does the data support? These questions in this subsection try to triangulate the question of interoperability from various angles. A quite visceral way of visualizing interoperability could be to show each record with the search hits it produces in the aggregate collections if its data values are used as search terms.
  • The section on Evaluation deals with questions which are much more subjective. They ask the question “How well does all of this work, and can it be improved?”, and to answer these questions, we’ll first need to define what it means for things to work properly. This section in particular is much more labor intensive, and in this draft much more sketchy. We’ll see how far we can get into it in our actual analysis work.

  • The subsection Suitability asks: How well do the records support search, retrieval, aggregation? Again, since you need to defined a baseline for what a good search, retrieval and aggregation experience looks like, this is a quite difficult question to answer.
  • Under Enhancement, we ask questions about how the suitability for search, retrieval, aggregation can be improved. This is a potentially huge area of inquiry, and a lot of experience already exists with aggregators of data. Maybe our contribution will be to investigate how much enhancement of the data is gained by using thesauri to intermediate searches, and we will try to bring the Getty vocabularies to bear on this question.
  • Take a look at the spreadsheet [pdf], and let me know what you think!