Archive for July, 2008

It’s still: Think Global, Act Local

Thursday, July 31st, 2008 by GĂźnter

For the last 5 years, my friend Mary Elings from the Bancroft Library and I have made a trip to upstate New York in the summer to teach a one week intensive 3-credit graduate class for the iSchool at Syracuse University (IST677 – Digitization in libraries, archives and museums). Every year, preparing for and teaching the class invites us to pause and reflect on how much has changed in just 12 months in the field of digitizing collections. While the main pillars in the outline of our from-the-cradle-to-the-grave syllabus remain, what we say about each topic changes considerably from year to year. Not surprisingly, we increasingly feel that we need to both impart current practice, while at the same time emphasizing new thinking in the field which challenges business as usual.

Luckily enough, Mary and I as a tag-team are well poised to take on that challenge. Mary predominantly reflects the local point of view of the professional who has to get things done in the here-and-now, while I predominantly reflect a global point of view of somebody who can afford to think about how things should adapt to the realities of our networked information economy. When we quibble, it isn’t just a disagreement, but an educational moment illustrating the times we live in; when it all comes together, it should read like the old bumper sticker as an exhortation to “Think global, act local,” as Mary pointed out to the students.

What follows is an impressionistic glance at the areas where I see a shift in what we talk about in class – in most instances, these are trends which I think will become more pronounced as we continue to teach the class.

  • Creating digital collections – we used to emphasize high-end digitization using digital camera backs; now we’re starting to talk about the potential advent of mass digitization for rare and unique materials.
  • Describing the collection – five years ago, the biggest wow-factor in this area was showing students that the use of XML allows institutions to repurpose collections in a variety of interfaces and makes content portable; while this still impresses them, we’re now also incorporating ideas about how to better leverage authorities (for example through Terminology Services), and we discuss the role of user contributed content. As in all other areas, we use Shifting Gears [pdf] to point to the balance between what’s commensurate, and what’s overkill.
  • Disclosing the collection – we used to emphasize contributions to LAM community-built aggregations such as statewide digitization programs, licensed resources, subject based aggregations or OAI portals; now we’re starting to spend much more time thinking through the implications of the Flickr Commons or pointers to special collections on Wikipedia – in short, leveraging the infrastructure around us which we (as a community) haven’t built and don’t control. (In revisiting my slides about the community-built aggregations, I realized that a good number of them have had to weather major transitions lately, and some didn’t survive, which indicates that we still don’t know how to sustain those resources.)
  • Using digital collections – we used to emphasize the studies (such as this Wesleyan/NITLE report) which show that faculty by and large don’t teach from licensed or locally built collections; now we are starting to think more about what role the library could play, for example in helping faculty manage their own personal collections.
  • Preserving digital collections – while we still talk quite a bit about the usual suspect acronym soup (OAIS, METS, PREMIS, NISO Z39.87), we’re now also considering issues of the economics of digital preservation (as exemplified by the work of the NSF Blue Ribbon taskforce). We also acknowledge the impact of mass digitization on digital preservation, which leads to a re-thinking along the lines of Oya Rieger’s recommendations in a recent CLIR report, such as Seek Compromise to Balance Preservation and Access Requirements and Reassess Digitization Requirements for Archival Images.
  • In this way, the class acknowledges that established work processes are likely to continue while we explore new ways of serving our audiences. It is a balancing act professionals must manage, whether new to or veterans of the library profession. By bringing these issues to our class, we hope to encourage our students to think of libraries, archives and museums as a field rife with possibilities for those with creative minds!

    Benchmarking Network Performance: Measures and Behaviors

    Thursday, July 24th, 2008 by Constance

    Over the last year, GĂźnter and Ricky have been examining models of collaboration across the cultural heritage community, working with RLG partners to identify the most significant obstacles and incentives to effective library-archive-museum partnerships. This is part of a program of work exploring cross-domain convergence in organizational structures and service requirements. GĂźnter has reported on some of this work here (May ’08), here (November ’07) and here (July ’07). A final report from this project is expected soon. It should help us to understand how cooperation contributes to improved institutional performance, including greater discoverability of collections, better integration of functions, and increased operational efficiencies. All of these objectives are important to research institutions that want to participate fully in the network information economy.

    In a related vein, Dennis recently queried members of the long-running SHARES inter-lending partnership about the hallmarks of ‘high-performance’ sharing partners. A new Working Group on High Performing Lenders has drafted a survey to

    learn what qualities [SHARES participants] value in a supplier and which … partners consistently display those characteristics

    I was interested to see that the criteria under consideration include some social behaviors and delivery options that respond to expectations that have been shaped (or at least sharpened) by the larger network environment. These include [with my annotations]:

    • Quality of scanning/copying [surrogates should meet or exceed quality of original] 
    • Willingness to supply rare or hard-to-find materials [if content is discoverable, it is assumed to be available]
    • Quality of holdings data in WorldCat [supply chains rely on accurate and reliable disclosure]

    The library’s ability to meet end-user expectations is dependent upon the performance of its service providers — including its inter-lending partners. This is increasingly true in an environment where ‘local’ holdings are assumed to be continuous with the collective collection of library content. The larger information network imposes a set of collaborative imperatives that cross organizational, institutional and geographic boundaries.

    All this has led me to wonder, as the ARL Library Assessment Conference (Seattle, 4-7 August) approaches, about the kinds of metrics we use to assess the value and performance of cultural heritage institutions, especially those that serve the research community. My colleague John MacColl attended a recent meeting of representatives from some of the organizations involved with with establishing and monitoring research library metrics, including ARL (123 institutions in North America), SCONUL (172 institutions the UK) and CAUL (41 institutions in Australia). There is evidently some interest in harmonizing these measures so that research libraries can begin to benchmark their collections and services against global indices. This suggests that libraries are seeking to situate themselves in an expanding network of information service providers in which performance standards reflect common operational requirements and social norms.

    There are a couple of measures in the current array of library performance indicators that can be used to assess inter-institutional cooperation on a system-wide scale. Resource sharing statistics provide a gauge of network participation and institutional co-dependence. Annual expenditures on shared infrastructure are another useful index of collaboration. Since the mid 1990′s the National Center for Educational Statistics has tracked US library investments in ‘bibliographic utilities, networks and consortia’ – an interesting combination of social and technological support systems — as part of overall library operational expenditures. ARL added this category to its statistical measures in 2004. An acknowledgement, perhaps, that institutional performance and achievement in the research library sector is increasingly dependent upon collaboration.

    In other sectors — notably supply chain management — the tangible benefits of cooperation have been studied more carefully. [References] Some fascinating work has been done to model instruments for measuring collaboration amongst “chain members” to enable reliable benchmarking and performance assessment. [Simatupang, T. M., & Sridharan, R. (2005).] The LAM community might benefit from a similar assessment framework, which acknowledges the multi-dimensional character of cooperation (information sharing, decision synchronization, alignment of incentives) and highlights its real operational value.

    Drycleaning your data

    Tuesday, July 22nd, 2008 by Merrilee

    This blog posting on BoingBoing caught my eye because it invoked both my beloved iTunes and my dreaded iTunes metadata. I am not a cataloger (not by any stretch!) but I am so completely disturbed by the embarassing disarray of metadata that my iTunes library represents. And it’s not just that I am (in my Virgo way) bothered by the lack of consistency and order. The data does not support some basic functions. Tracks that are labeled “track 1″ prevent me from finding the song I want. Tags for compilation albums or classical music often lacks data necessary for searching or sorting (is the artist the name of the album or the actual artist? composer or the orchestra? this is treated differently by different people). If I wanted to pull together a funk compilation, I couldn’t do it based on the metadata I have because the genres have been supplied by the wisdom of the crowds and has not been normalized. And the wisdom of Apple dictates that something can only have one genre. This is disappointing to me, because based on my collection, I could put together a funk mix that would knock your socks off.

    Enter, TuneUp.

    …a plug-in for iTunes that cleans up your library’s metadata and grabs the missing album cover art. It takes an “audio fingerprint” of each track and then gets the appropriate data from Gracenote’s Global Media Database. It’ll also let you know if you’re missing any tracks from a particular album…. The company claims they’re averaging a correct rate of 85 to 90 percent. A quick flip through my library makes me think it worked even better than that for the metadata and about that well for album art.

    TuneUp Companion has several other features that I haven’t personally seen in action. It grabs contextual content from various places online. For example, if you’re listening to “Creep” by Radiohead, the “Now Playing” feature will check YouTube for live videos of the song and search for bio info and music news. The Concert feature looks for tour information and can be set to alert you if a band is coming to your town. Gabe told me they’re planning to open up the “Now Playing” API so anyone can create their own contextual content features.

    Much of my iTunes metadata comes from FreeDB (an alternative to Gracenote). Although some of the other features are interesting, I’m most interested in cleaning up my existing data.

    For me though, the Clean feature is the big selling point. TuneUp costs $12 a year or $20 for a lifetime of use.

    Note that the service costs. A small amount, and relative to the amount of time I could spend on this task, a bargain.

    What would be interesting is a cleanup of the data sources themselves. If the data cleanup that I did was pushed back into FreeDB, then there would not be so much data in need of cleanup in the first place. It’s the synchronization of the cleanup that is the interesting and challenging part, but where most of the rewards could be reaped.

    I think about this in terms of other piles of individually created and pooled data, like the bibliographic records in WorldCat. We see a lot of inconsistencies (reflecting changes in practice, changes to the MARC standard, limitations of local systems where data was created, etc.). Could the data in WorldCat be strategically “dry cleaned” for maximum benefit and then returned to local systems?

    Discovery AND Selection = Elsewhere

    Monday, July 21st, 2008 by Jim

    This slide caused the most discussion and comment during my presentation at the AALL workshop about which I posted previously. I return to it here for a few reasons.

    Some of these assertions have attained meme status. In particular I’ve noticed that Roy’s characterization of searching and finding (which he’s been saying since at least 2005 – I’m sure he can tell us the exact date of the coinage) and Lorcan’s dictum about discovery were listened to with some skepticism and resistance only 12 months ago. They are now treated as common knowledge and an accepted starting point for discussions of our issues. This is good for us. It focuses us on change.

    The next two about getting our services and assets into the work flow of the user on the network and about needing to present users with all of our system-wide assets aren’t yet memes but they have entered the vocabulary. Lorcan’s ‘networkflow‘ coinage I find helpful and apt in getting at the essence of the way we work and the collective collection phrase (about which Constance has blogged and spoken continuously) neatly and alliteratively captures what people really want to access. I hear other people use these phrases without expecting that they need to be explained. This is progress. These two observations are really about how we should change our services and invest our energy.

    The last assertion about selection is far from a meme. In fact, it may not be true. But it could tell us more than any of the others about where we can choose to disinvest and redirect resources and effort.

    The formulation arose in a group discussion led by my colleague, Arnold Arcolio, while reporting on user testing and interviewing that he was leading in connection with WorldCat Local at the University of California. While he has much analysis to do and considerable discussion yet to come with UC colleagues, one of the preliminary observations emerging is that the test partcipants overwhelmingly approach the local (or group catalog) with an item already chosen. Using the catalog as a research tool – a place to refine a general interest into a small number of selected ‘best’ items that answer an immediate need – seems to happen very infrequently. In these early interviews the idea seemed quite unusual to the faculty and graduate student users of the catalog.

    During our group discussion this user behavior was capsuled as “Selection takes place without us.” We were intrigued with the potential import for our processes and practices should evidence emerge showing this to be generally true. Our investments in description and classification, in the functionality of the local/group catalog and many other areas could be re-examined and recast. If selection takes place without us then our efforts could be redirected to activities valued by our users and our institutions. I’m interested in spinning out the range of impact but, of course, we need evidence to take this beyond a thought-experiment.

    Biodiversity Collections Index

    Friday, July 18th, 2008 by GĂźnter

    Roger Hyam at the Royal Botanic Garden in Edinburgh just flipped the switch on a beta version of the Biodiversity Collections Index, a website which aims to connect various silos of collections information in the field of natural history.

    The approach chosen is quite interesting: for starters, BCI has harvested a number of large lists of collections from different disciplines, such as Index Herbariorum (IH), Insect and Spider Collections of the World (ISCW) and the BioCASE database. It then applies a Globally Unique Identifier (GUID) to all of the collections so they can be unambiguously cited by researchers, and resolved in a networked environment (for example utilizing a variety of web services offered). All of the data is available through a Attribution 3.0 Unported license. At the backend, to the best of my knowledge the data (currently) resides in a fairly basic version of the Natural Collections Description (NCD) standard, which an RLG Working Group first envisioned. So far, so good. What I find really refreshing about this project:

    Any member of the biodiversity research community can register and contribute to the data held in the index. In addition to this, authoritative data that has been curated by established sources is displayed in a non-editable form alongside the community data. [BCI homepage]

    It’ll be interesting to see whether the biodiversity community embraces this resource, and contributes to the records and the success of the BCI. Another avenue for institutions to contribute large batches of data will be through the NCD toolkit create by ETI BioInformatics in Amsterdam – the BCI website states that a synchronization mechanism between this data capture tool and the aggregation is in the works.

    Congratulations to Roger and all the institutions who’ve signed on to support this effort!

    A Map to Destinations Uncrawled

    Friday, July 18th, 2008 by Roy

    As part of what we’re trying to accomplish in the “Modeling New Service Infrastructures” part of the RLG Work Agenda, I’ve been working on a white paper on best practices for enhancing disclosure of library, museum and archive content at the network level. Using a sitemap to better expose content to web search engines such as Google is one technique that I’m investigating, but it is clear that this technique is by no means a silver bullet.

    For those of you who aren’t familiar with sitemaps, they are a way to expose information for a web crawler that is normally hidden behind a database wall. Since web crawlers cannot intuit all the necessary queries to extract all the data from a database, other techniques must be used to provide the crawler with crawlable URLs. One such technique is to create a sitemap, which is an XML file with all the URLs you wish to have crawled on your site. Google, Yahoo!, and Microsoft have collaborated through the Sitemaps.org site to establish a common protocol for sitemaps.

    I’ve used this for some time now to make sure all the images on my photos site were exposed to crawling. But recently I checked on it in Google’s Webmaster Tools site (where you can register your sitemap to make sure Google finds it), and I discovered to my dismay that only a small percentage (38%) of the URLs had been indexed in Google. On that site they state that “Most sites will not have all of their pages indexed,” but the only clues why are perhaps contained in the “Webmaster Guidelines” that describe what to do or not do to increase your chances of being indexed.

    I  asked the subscribers to the Web4Lib discussion what their experience has been. Debbie Campbell of the National Library of Australia (an RLG partner institution) reported that of the over a million items in Picture Australia, Google had indexed only about 49% of the URLs, although it was up dramatically over a couple weeks ago when the number was a paltry 32,000.

    Marshall Breeding of Vanderbilt University and the owner of Library Technology Guides reported a higher percentage of 62% of his URLs indexed. He also reported that he’s doing some testing of the priority setting of potentially uncrawled URLs (based on low page views) in his sitemap to see if that helps.

    Danielle Plumer of the Texas State Library and Archives Commission pointed out an article that discusses related issues, since of course this is a topic of interest to those trying to achieve “search engine optimization”. She also sent along part of a Q&A with Matt Cutts of Google that bears on this issue:

    Q: “My sitemap has about 1350 urls in it. . . . . its been around for 2+ years, but I cannot seem to get all the pages indexed. Am I missing something here?”
    A: One of the classic crawling strategies that Google has used is the amount of PageRank on your pages. So just because your site has been around for a couple years (or that you submit a sitemap), that doesn’t mean that we’ll automatically crawl every page on your site. In general, getting good quality links would probably help us know to crawl your site more deeply. You might also want to look at the remaining unindexed urls; do they have a ton of parameters (we typically prefer urls with 1-2 parameters)? Is there a robots.txt? Is it possible to reach the unindexed urls easily by following static text links (no Flash, JavaScript, AJAX, cookies, frames, etc. in the way)? That’s what I would recommend looking at.

    So what can I conclude from this? Not a whole lot yet, except that creating a sitemap and registering it with Google is not by any means a silver bullet. It’s still a good idea, but it should only be one strategy of many we can take to get the unique content that libraries, museums and archives hold in front of the roving eyes of web users. It’s exactly that suite of strategies I’m trying to identify and describe, so please let me know of any methods you’re using that you think may be effective, either as a comment to this post or direct email.

    Have it your way!

    Thursday, July 17th, 2008 by Merrilee

    In RLG Programs we talk a lot about how important it is for libraries, archives, and museums to find ways to get their materials (or descriptions of materials) beyond their catalogs, so the materials/descriptions appear where researchers and other users are. It only seems fair that we try to practice what we preach.

    Why should you have to rummage around on our website, or wait for an email to tell you that papers, podcasts, and webinars are available? In an effort to make our content available as many ways as possible, we’ve made a number of new RSS feeds available. You can choose those that best fit your needs.

    Introducing,

    1. A new RLG Programs feed for updates from RLG Programs.

    2. Our new PARcast feed which includes all podcasts and webinars released in our PARcast series. PARcasts are also submitted to the iTunes store, so you can get them via the iTunes as well (search on “OCLC PARcast” or even just “PARcast”).

    3. A new combined OCLC Programs and Research feed includes all news and updates from both RLG Programs and OCLC Research, including PARcast releases.

    These feeds are in addition an existing OCLC Research feed, which includes updates from OCLC Research.

    The new feeds will also allow HangingTogether to focus more on discussion-like postings, and less on announcement-type postings. So sign up, already!

    Shared Print Collections: the UK (re)invests

    Thursday, July 17th, 2008 by Constance

    Nicola Wright, Information Services Manager at the London School of Economics and a fellow of the RLG Shared Print Working Group, shared some good news this morning: the UK Research Reserve (UKRR) has won a bid for continuing funding from the Higher Education Funding Council for England (HEFCE). The UKRR recently wrapped up a successful 18 month pilot aimed at demonstrating the value and feasibility of a collaborative approach to managing legacy print journal collections. Six higher education institutions in the pilot selected low-use journal titles to be deaccessioned from local collections; a target threshold of distributed duplication was established and, with a durable commitment of access provided by the British Library’s Document Supply Center, participating libraries were able clear shelves of volumes of material that no longer served immediate local needs.

    Owen Stephens (Imperial College, London) provided an excellent ‘mid-term’ report on the progress of the UKRR back in February, which includes some useful details on the direct costs associated with de-duplication activities — estimated at ÂŁ26.16 per meter (or about $16 per foot) of open shelf space — and other implemenation details of interest. Imperial College was the lead institution amongst the Phase I pilot participants.

    Two key success factors of the UKRR project stand out: the high degree of confidence that faculty and researchers have in the British Library, and the recognition that the costs of local de-duplication are a real obstacle to realizing economies of scale in collaborative print management.

    According to Nicola, the new ÂŁ10 million award

    will enable the Higher Education libraries to release 100km of shelf space by sharing storage of a reduced number of print copies of journals across the HE network, with a copy of each title at the British Library – with access provided by the BL using desktop delivery of articles. We estimate that this will give a ÂŁ29 million capital saving to the sector.

    Lorcan wrote about the early days of the UKRR here, noting that “effective management of space is a driver” for new models of library cooperation. Lizanne Payne’s 2007 report on library storage trends in North America makes the case for more collaborative approaches to distributed print preservation. In the absence of a centralized delivery hub like the BL, the UKRR model is unlikely to be replicated in the US — but it still stands as an instructive case study in managing the collective collection.

    Further information about the HEFCE investment in the UKRR is here.

    Networking Personal Collections: Portable Bibliographies

    Wednesday, July 16th, 2008 by Constance

    Like many of my multi-tasking Programs colleagues, I use a variety of online notebooks, bookmarking and citation management tools to keep track of documents and references related to ongoing projects. It’s a source of continual frustration to me that the reference lists and bibliographies that I develop in one space can’t be easily moved to another for re-use in a new context. There are plenty of tools to help me capture data as I move around the Web, but few that enable me to pool or integrate the sources that I’ve squirreled away in different folders and lists. My Google Notebook of shared print policies can’t be swapped into my Zotero library on collaborative collection management; a collection of tagged resources in NINES can’t be pushed into a personal bibliography in WorldCat; an iTunes library of RLG podcasts can’t be merged with a list of publications. These collections live quite near the surface of the Web, but they don’t exhibit the kinds of elective affinities that one might expect in the “social” network environment.

    Of course, lists built with most citation management services can be exported and moved around in a number of standard formats (RIS, BibTeX) — but in a world where research data, services and social practices are moving onto the network, one might wish for a better solution. A solution that would enable students and researchers (and people like me) to create and exchange references and citations for the full range of information objects that we use, support context-aware resolution and delivery services, and allow references to be decoupled from the environment in which they were created and reintegrated into other work products. I want my references to free-associate with “others like this” in ways that will enhance and enrich my discovery experience. I want my references and citations (and yours) to rise to the surface and make themselves known.

    Some standard tools and protocols exist to make this kind of network-aware personal collection management possible. The COinS specification for embedding OpenURL linking (citation) data in Web pages has been around for several years and has already been implemented in a variety of library service environments, from local OPACs to WorldCat record displays. (Lorcan has noted various advances in this area, here, here and here.) In combination with an OpenURL Resolver, it enables Web users to link to library-owned or licensed content. For example, COinS enables Wikipedia users to locate library-owned titles from citations in articles, like this bibliography of W.H. Auden:

    and it also powers Zotero’s seemingly magical ability to lift citation metadata directly out of certain kinds of superficially unstructured Web content, including blog postings:

    As for my dreams of a day when references and bibliographies are freed from the various boxes, drawers, folders and files in which they languish…well, we’re a step closer now that WorldCat lists have implemented a COinS compliant view of bibliographies built within WorldCat.org. Thanks to some energetic OCLC colleagues, all of the citations in the tens of thousands of public WorldCat lists, can now be easily mixed and commingled, set free to move about the Web as resolvable, actionable references. If you’re a Zotero user, you’ll see the change immediately: WorldCat lists now self-identify as content that’s ready for exchange and circulation

    – shiny new COinS of the network realm.

    A wait that’s (almost) over

    Tuesday, July 15th, 2008 by Karen

    As I blogged before, Some things are worth waiting for. And yes, 2008 is indeed the Year of Non-Latin References in LC/NACO Authority Records. The first of the pre-populated name authority records with non-Latin script forms datamined from WorldCat that had previously been visible as “alternate names” in WorldCat Identities have made their debut!

    So the name authority record for Pak Sang-nim now shows both the hanja form of the name, 朴 相林 as well as the hangul form, 박 상림.

    --------------------------------------------------------------------------------------------------------------------------
    LC Control Number: n 2008046884 
    HEADING:           Pak, Sang-nim, 1927- 
     Used For/See From: Park, Sangrim, 1927- 
                       박 상림, 1927- 
                       朴  相林, 1927- 
    Special Note:      Non-Latin script reference not evaluated
     Found In:          Hongik hwabaek esŏ ch’ajŭn ch’amdoen t’ongil ŭi kil, 2008: cover (朴  相林 = 박 상림 = Pak
                          Sang-nim) added t.p., etc. (b. 1927 in Hamnam Yŏnghŭng; chŏngch’ihak paksa, Kŏn’guktae
                          Taehagwŏn; w., Kungmin Ŭnhaeng; hoejang, Minjok Hwahae Yŏnʾguhoe; Park Sangrim [in rom.])

    ————————————————————————————————————————–

    The prepopulation process has just started and will take some months. Nevertheless, those of us who have been waiting to see non-Latin script references in authority files for a quarter-century (or more!), our wait is (almost) over! And we can mark the OCLC Programs and Research project to “lead and effort to upgrade the LC/NACO authority file with non-Latin alternate names” completed.