Europeana at the Halfway Mark

For the recent LIBER/EBLIDA workshop on digitization at the Koninklijke Bibliotheek in The Hague, I was asked to provide a view on Europeana from the US perspective. Of course, I neither speak for the US nor do I have inside information about Europeana, but I’d been following it from afar and had read just about everything I could get my hands on, so I gamely took the challenge. [Only someone as bloodied by digital paper cuts as I would dare to take on Europeana.] I wasn’t bombarded with rotten tomates, courgettes, and aubergines, so I guess it went OK. My remarks are now available in Volume 19 (2009), No. 2 of the LIBER QUARTERLY.

Tweet about this on TwitterShare on TumblrShare on LinkedInShare on FacebookBuffer this pageShare on Google+Email this to someone

About Ricky Erway

Ricky Erway, Senior Program Officer at OCLC Research, works with staff from the OCLC Research Library Partnership on projects ranging from managing born digital archives to research data curation.

5 Comments

  1. In the article on LIBER QUARTERLY, you say (nicely and rightly):

    “Mapping is a mythical grail.

    What follows is a gross generalization (to which I have found no exceptions): Librarians want metasearch or federated searching. They do not like their own implementation. They blame the deficiency on metadata mapping. If they just had a better crosswalk, it would be better. So they change their software, retool with better mapping, and they still do not like it.

    The reason is that a butterfly specimen has entirely different metadata than a painting of a butterfly. Who is the creator and what is the title or subject of a butterfly specimen? What is the Latin name or habitat of an impressionistic rendition of a butterfly? Just how many fields can be mapped between these two records?

    My recommendation is to require a very small set of common elements and allow the rest to aid free text searching. Europeana’s adoption of OAI-PMH and Dublin Core is a good thing. It precludes the development of yet another approach and adopts one that others may already be using. Requiring some very basic elements makes some advanced searches or filtering possible. If participants are allowed to leave required elements empty, it will render those documents not discoverable. Allowing data beyond what is required will allow for better retrieval, but just through free text searching. That’s pretty much what users do anyway, type words in a box. Google manages to make it work.”

    I’m a librarian, and I agree with these comments on mapping and the wisdom of a few common fields across all domains (and the corallary wisdom in keeping all the particular metadata from the various domanins accessible, too, but as keywords.)

    Is there, though, any agreement on what data elements are in that small number? Or must that agreement vary from project to project?

    I can think of 5 that seem universal to me. Each element answers a question one could ask about the resource or about using it.

    1. What is it?

    A butterfly, an image of a butterfly, a poem about a butterfly, a piece of jewelry shaped like a butterfly? This may turn out to be more than one question. For example, What general kind of thing is it? and What specific thing is it? An animal. A butterfly. A Monarch butterfly. A juvenal female Monarch Butterfly. For some things, a sub-question may be What is it made of?

    Note: when needed, the metadata that answers this question needs to distinguish clearly between the object itself and any digital representations of it. This distinction, when needed, must be followed through in the following quesitons.

    2. What is it called?

    This isn’t necessarily a title. It could be a specimem number. But if it is collected and described at all it needs to be called something.

    3. How old is it?

    If made, when was it made? If natural, what is estimated age? Note that it is likely that these two variations won’t use a common date format. For example, a book published in 1924 and a bone fragment from 40-50 million years ago.

    4. Where is it?

    This could be the physical location or a URI or both.

    5. Who may use it? Or, Can I use it?

    Maybe, this question shouldn’t be considered part of the metadata about the object since it is metadata about the use of the thing. Still, though, use is so tightly linked to one’s interest in the object that this question will be asked and needs to be answered.

    Those are my 5.

  2. That’s sounds like a good user-driven basis for selecting the few required fields and I would agree that we’d want that information in an item’s description.

    However, in an aggregation like Europeana — or in fact in most metasearching contexts where the data to be aggregated already exists — you can ask for, but not demand, certain metadata elements. For instance, early on when we aggregated archival finding aids, the only fields we could count on were title (even though the title might be something less than helpful, like “Selected Papers”) and the name of the contributing institution (which we often had to supply, since it usually doesn’t appear in what were intended to be local records).

    Trying to take advantage of the power of EAD-encoded finding aids resulted in misleading results. For example, offering a search index on tagged geographic place names resulted in hugely under-reported results, because that tag wasn’t widely used. Instead we offered an advanced search box for geographic place names, but it was actually doing a full-text search. It just reminded people that that they could search on place names — and the results were pretty good.

    The other problem is that even if you could get people to agree to provide the same 5 elements, unless there are content standards for populating those elements, you haven’t achieved much. As you say, the dates will be stated in inconsistent ways, so we can’t use them for date-range searching or even for display on a timeline. If different people answer your first question, What is it?, by saying a butterfly is: an object, a life form, a butterfly, a monarch, a Lepidoptera, a Danaus plexippu,… we haven’t made much progress on improving searching or browsing.

    While the Dublin Core elements were originally intended to identify the few necessary fields for discovery, they’ve been stretched and twisted in ways that defeat the original purpose. And there’s still debate on whether the record should describe the original thing or the digital version. So while the location of the Taj Mahal is Agra in India, the location of the photo might be the Hensley Photo Library at the University of Chicago, and the location of the digitized image might be (to be completely ridiculous) the user’s temporary browser cache.

    I think we’ve all so badly wanted what we know is possible, that we find it hard to make the needed compromises in order to offer good-enough access.

  3. Ricky,

    We’ve been drinking the same kool-aid. As part of one of the in-house Mellon-funded “collection collaborative” grants, we worked at Yale on a cross collection search tool to help researchers find materials in the library collections, the natural history museum, two art museums, and other collections. We learned pretty much the same lessons you stated above.

    The index we created in a proof-of-concept tool depended largely on keyword because the descriptive metadata was used “as is.” Each collection gave us what they had. (We mapped what we could to MODS for our own convenience and to produce a bit of fielded searching, but that was an add-on to the native medtadata.) The result, though, was a fast, dirty and effective search tool, or, at least, proof that we should actually build one and use it. That task is now underway.

  4. Fortunately a lot of our users are becoming accustomed to — and are quite accepting of — fast, dirty, and efficient!

    So why beat ourselves up to provide something that’s just a little less dirty, but not as fast or efficient. Today recall is the expectation and precision comes in the form of relevance ranking.

    Paraphrasing the proverb “When all candles be out, all cats be gray” “When databases are aggregated, all data be dirty.”

    I’ve been wanting to have this conversation — odd that my Europeana presentation prompted it, but they’ll be facing the same issues on a grand scale.

    Thanks for engaging!

  5. Ricky and Matthew–
    The article and your subsequent exchange have been very interesting. Matthew–I’m very interested in learning more about your cross-collection search tool. I meet regualarly with the heads of the campus art museum, natural history museum, and botanic gardens. At lunch last week, we wewre talking about just a such a tool. I agree with Ricky that our users have certainly becoime accepting of fast, dirty, and efficient. At home, I see it in Clare and Julia and, Matthew, I’m sure that you must see it in Kate, Hannah, and Will.

    I also have to say that this conversation reminds me of allthe work Annie did on an IMLS grant to come up ith a multicultural, multi-language (?) metatdata schema for the performing arts (www.glopac.org)

    Thanks to you both for a stimulating end to a long week!
    –Paul

Comments are closed.