ISBNs in WorldCat

May 23rd, 2013 by Roy

Recently a question came up on the BIBFRAME list about ISBNs, and how many of them were in MARC records. This is just the kind of question that OCLC Research is uniquely placed to answer, so I quickly wrote some simple Perl code to run as a Hadoop streaming job to find out.

It was remarkably quick and easy to find out, although I had to edit and re-run the code when I discovered a flaw in my logic. This is, sadly, all too frequently the case. But not too much later I had my result:

Occurrences # per Record Percent of WC
230444194 0 77.71%
55668178 2 18.77%
4766652 1 1.61%
3708352 4 1.25%
616623 3 0.21%
411230 6 0.14%
125715 8 0.04%
65796 5 0.02%
45304 10 0.02%
30155 12 0.01%

These are all of the occurrences of a 020 $a in WorldCat as of 1 May 2013 [Added for clarification: the prior sentence describes exactly what is being counted. That is, I am not (yet) examining ISBNs for 10-digit vs. 13-digit; therefore, many of the records with 2 ISBNs may in fact simply have both versions].  A few observations:

  • Many items in WorldCat were published before the invention of the ISBN.
  • Many items in WorldCat are not ISBN-appropriate (e.g., unpublished materials).
  • ISBNs are therefore problematic as identifiers except for a narrow slice of materials (mainly printed books since the mid-60s).

A much better identifier for many purposes is, I assert, the OCLC number.

Related posts:

Sex Ratios in Wikidata, Wikipedias, and VIAF

May 13th, 2013 by Max

Last week I wrote about the ‘rope bridge’ between Wikidata and VIAF, and the new research it would afford. Today I bring you a sample of that research. I am investigating the sex associated with different Wikipedia Biography Articles for two reasons. Firstly, the Properties “Sex” and “VIAF” are two of the top 10 most used Wikidata Properties, with Sex at 587,312 items tagged, and VIAF with 301,763 (and rising, VIAFbot hasn’t finished scraping all languages yet). VIAF independently records sex per VIAF item, which gives us two comparable datasets. Secondly, after the so-called “Categorygate” piece in the New York Times I dug into Wikidata’s Sex Property and wanted to shed some light on the model currently in use.

Currently the Wikidata Property for Sex states:

Sex for humans, should be one of male, female , intersex, or the special “unknown” value

Finding this to be a rather rigid view of the world I started discussing it on the Discussion Page as per protocol. Of note, on the other hand is how VIAF records “gender” not “sex.” The current VIAF data model similarly limits values to male, female, or unknown but a change to a more nuanced model is planned for June. Its worth reminding that VIAF is populated with data from the many authority files it aggregates. One underlying authority file, which has a more nuanced view on this recording, is the Library of Congress Control Number (LCCN). The LCCN will record many “sexes” for a specific person with accompanying dates of validity. This at least shows that there are better ways of recording sex – if its necessary to record it at all – which prompts me to invite your input on the Wikidata Discusison Page about better ways to record sex. With that said, lets dig into some graphs. (Click to see larger versions.)

Sex Ratios by Language

The method used to perform this visualization is to view all the Wikidata items with Property:Sex and then look at the inter-language link section of the item to see which languages have articles relating to this item. Dividing along the lines of language, we can find sex ratios per language. Below shows each language with more than 1,000 articles tagged with sex data, sorted by the percentage of Female values.

Wikidata Sex Ratios By Language

Wikidata Sex Ratios By Language, Minimum 1000 Items

If you’re not well versed in Wikidata’s use of language codes, you can look them up. And if you’ve never browsed the winning and losing htwiki and tlwiki, the Haitian and Tagalog Wikipedias, then you can peruse the list containing minimum 10,000 Items with Sex Data.

WikidataSexRatiosByLangAlone_Min10000

Wikidata Sex Ratios By Language, Minimum 10000 Items

Two notable things arise here. Firstly, Chinese Wikipedia is seemingly the most progressive. Secondly The Intersex category fails to score a single pixel of recognition. In fact the Wikipedia with the highest ratio of Intersex values – as determined by Wikidata – is Korean Wikipedia, but at just 0.0078%.

Data Caveats

Is this data reliable? A lot of it was imported from the German and other major Wikipedias. That can be a problem, because for any given Wikipedia Language there exists articles that have no linked equivalents in other languages. There may very well be Wikipedias with more or less skewed sex ratios, but they haven’t migrated their sex data to Wikidata, or they have no equivalent article in a language which has migrated its sex data. Lets see which languages have the most articles associated with sex data, of those above 1,000.

WikidataSexTotalByLangAlone_Min1000

Total Number of Wikidata Items Tagged with Sex

Unsurprisingly we get a very Western view of the world. But wait, there are other data sources to corroborate against; that was one of the points of VIAFbot importing VIAF IDs into Wikidata. Let’s imagine an enhanced version of Wikidata, that uses VIAF sex data in addition to what’s currently tagged, using that VIAF ID bridge. I ran simulation of such enhanced version of Wikidata, but before we look at it, lets understand VIAF’s own biases.

Introducing VIAF

VIAF IDs have gender info derived from National Library files. There’s hope this may give us a different picture because VIAFs may be ever slightly less severe in its skew, although looking at its list of contributors reveals also a Western bias. Of ~24 Million VIAF records (not all about people)  1,299,396 have gender “male,” and 418,394 have gender “female.” This comes out to a percentage of 24.35% female.  (Unfortunately VIAF doesn’t note directly where LCCN has a more nuanced view, but it can be determined by crawling the RDF link to LCCN’s Marc XML which I explain later.) Now to compare the Wikidata and VIAF-enhanced-Wikidata sex ratios we overlay the two graphs. Here wherever you see light green that means that Wikidata’s data alone gave a higher female ratio, and where you see red, VIAF-enhanced-Wikidata data gives a higher female ratio.

Comparison of Wikidata Sex Ratios with and Without VIAF by Language, Minimum 1000 Items with Sex

Comparison of Wikidata Sex Ratios with and Without VIAF by Language, Minimum 1000 Items with sex

Reassuringly VIAF and Wikidata only disagreed on 0.0024% of 91,406 matches. There were seven cases where LCCN did have with multiple sexes and qualifying dates. Furthermore there are 52,407 cases where VIAF has Sex data but Wikidata does not. This might be a good juncture to import that data, if the Wikidata community wants.

Conclusions

There are articles in Wikidata which are not currently tagged with sex information, but whose sex information can be programmatically determined. There is some indication that tagging more articles would tend to produce more even sex ratios in Wikidata. If that were true, it would mean that “male” articles are more likely to be associated with sex data, though we cannot be positive about that claim. Finally recall that Wikidata’s data model for sex could also use some attention, and you the community are the instruments for that.

Software used

I wrote some simple scripts to crawl Wikidata and compare it VIAF and LCCN, its on  GitHub, and I also modified code from the Wikidata community for parsing dumps which I plan to contribute back.

Did you find anything confusing? Leave your comments below or find me online. On twitter I’m @notconfusing.

Related posts:

The Ropebridges: Authority Control in Wikidata

May 9th, 2013 by Max

You may recall that our Wikipedia reciprocal linking robot “VIAFbot” finished adding Authority Control to more than a quarter of a million (English language) Wikipedia articles, but what was the utility? Five months on, that question has been answered. Luckily, and unsurprisingly, other netizens proved additional Wikipedia -> VIAF linking utility. Unanticipated reuse is the magic of collaborative and open datasets, and four such examples highlight the benefits of Library data in Wikipedia.

First was John Mark Ockerbloom’s Forward To Libraries which proposes “find in a Library” boxes in Wikipedia pages. The idea is compelling: facilitate automatic searches in your preferred library site on the topics of Wikipedia articles — one option utilizes VIAF IDs.

Similar look-up facilities were created by Owen Stephens and Thomas Meehan conducting pointed inquiry at the British Library site and other UK Academic resources. Stephens’ contemporaneous finds authors sharing their birth year with the Wikipedia page in question. Meanwhile Meehan’s bookmarklet will funnel you into relevant pages linked by VIAF at UCL’s Explore, and COPAC.

VIAF connections can also pave the way for new scholarly research. A team from Vienna University of Technology, released a paper that visualized Art History networks of Wikipedia, through VIAF IDs, and then ULAN. Here you can see the proportion Art History Subjects in Wikipedia, displayed on two dimensions derived from the ULAN connection: time and nationality.

All of this is to say that VIAF data in English Wikipedia can as a very good ropebridge that allows for reuse, or recombination. The idea of a ropebridge is apt because the connection is somewhat shaky, at the moment it’s free text, semi-structured data that can be changed by anybody, but that doesn’t mean that the chasm isn’t being crossed.

Can you spot the weakness in all this collaboration though? We focused our first effort on English Language Wikipedia. The Germans, to their credit, have just as many VIAF IDs in their Wikipedia. The Italians copied the English Language data. However these separate efforts are not scalable to all 285 Wikipedias, nor does it allow all 285 Wikipedias to collaborate on the language-neutral VIAF Unique Identifiers.

Fortunately there is a solution, and that solution is Wikidata. Wikidata is first new Wikimedia Project since 2006, and will do three things. It will organize inter-language links into a central database (inter-language linking before was arduous and asymmetric). It will provide a central store of Semantic Data from the Wikipedia articles. And in the future it will be able to query that semantic data. Want to know more about Wikidata? Then look up Wikidata on Wikidata (obviously?!).

    Now for a surprise – I’ve just finished migrating English Wikipedia’s VIAF data to Wikidata, and German, French, Italian, and Japanese datasets are in progress. (Code on Github). It takes about two weeks to inspect, clean, and copy the data over from each Wikipedia. I’ll post a full statistical breakdown once all the languages have finished. For now I’ll just say that the Wikidata VIAFbot is also migrating LCCN, GND, BNF, and SUDOC Identifiers as well as integrating for the first time ISNI IDs. At the time of this writing it records 750,000 edits and counting.

    What does VIAF in Wikidata look like you ask? All pages about encyclopedic concepts are known as “Items” in Wikidata parlance, so lets inspect the item for Germaine Greer.

    wikidata_claims

    We first see all the Semantic Data Wikidata has about this topic. Each modicum of data is known as a “Claim” in Wikidata, is a triple,  and is structured as [this page] [property] [value]. You can see that [Germaine Greer] [GND (read: "is a " according to the German National Library)] [Person], and that [Germaine Greer] [is of sex] [female]. You can also see here that she’s got a lot of identifiers associated with her thanks to VIAFbot, which has sourced where it found the original VIAF ID. Now lets draw our attention to the bottom of the page to understand the impact.

    wikidata_iwlinks

    This Wikidata page is associated with articles in 48 other languages. Each of those articles can capitalize on the semantic data stored above. That’s the beauty of Wikidata. Which now means that all of the data reuse cases that previously only worked for the English language Wikipedia, will now work for all of them. Austrian researchers can inspect Art History biases of not just English Wikipedia, but of dansk, Ελληνικά, हिन्दी, interlingua, Runa Simi, 中文, etc. etc.  That’s one of the starting reasons why it’s important to have Authority Control in Wikidata. There are of course more directions than one to travel across a ropebridge. Leading data-mules of bibliographic information across from VIAF into Wikidata is next.

    Related posts:

    Concentration, Diffusion, Centers & Flows

    April 30th, 2013 by Constance

    Our 2012 mega-regions analysis revealed a notable feature of the library landscape in North America:  the apparent  scarcity of print inventory varies significantly depending upon the scale at which it is assessed.  It stands to reason, of course, that a title held by a single library in one locale may be held by many libraries in other places.  What is more surprising is that scarcity – or what is more appropriately termed diffusion – is a characteristic that persists even at the scale of the mega-region.  In every one of the 12 regions we examined, more than 75% of the print book titles are held by five or fewer libraries.  Yet, comparing one mega-region to another, we found a high level of bilateral duplication:  for example, more than 70% of the print book publications held in Cascadia – a mega-region encompassing urban centers in Oregon, Washington, and British Columbia – are duplicated by library holdings in the NorCal region.

    Bi-lateral duplication of print books in Cascadia and NorCal

    Bi-lateral duplication of print books in Cascadia and NorCal

    Now it may be the high degree of integration that is a primary characteristic of mega-regions is also a factor in the diffusion of library resources within those same regions.  Arguably, the strong networks of exchange and robust logistics infrastructure of mega-regions help to explain why we find such a low level of redundancy in library collections within those regions.  Even without coordinated collection development plans, it may be that the ease with which resources (including library books) flow within mega-regions exercises some influence on library acquisitions.  The incentive to acquire ‘just in case’ inventory will be less in a region where inter-lending networks are strong and it relatively easy to obtain copies from neighboring institutions, and this confidence in regional supply options may have a sort of invisible-hand effect that constrains redundant acquisitions.

    This would suggest that the high degree of diffusion we see in regional collections – inventory distributed across a number of geographically distant institutions – is a characteristic of a ‘well-organized’ (though not deliberately engineered) library system.  By contrast, library collections in institutions located outside of mega-regions tends to exhibit a higher degree of redundancy.  This is partly a reflection of the large number of libraries that fall outside of the defined mega-regions – more than nine thousand individual OCLC institution symbols – but it seems likely that greater redundancy is needed to support demand in regions where ‘flows’ may be less efficient.   Compare the average library holdings per title for collections held outside of US mega-regions (about 14) to the ratio of holdings per title within mega-regions, which ranges from 2 to 9, for the Phoenix metro area and Chi-Pitts respectively.

    Of course, there are other factors at play:  as a recent New York Times article showed, the geographic distribution of major research universities (and the libraries that serve them) is uneven – and regions with fewer research libraries will have a smaller concentration of rare or unique materials, compared to regions with many research intensive universities.  Not surprisingly, the BosWash region, which encompasses a substantial part of the ARL membership, has a relatively low level of redundancy in print book holdings (about 7 holdings per title) simply because the area is home to many institutions with large collections of rare or distinctive materials.

    ARL Membership Map (2013)

    Conversely, in areas with a high density of public libraries — which generally hold large collections of popular classics and best-selling titles — we see higher levels of overall redundancy in collections.  So one cannot infer that low levels of redundancy in library holdings across any region, whether organized against the mega-regions framework or anything else, are a reliable indicator of strong or weak flows.  Other regional factors, like the distribution of research universities and public libraries are clearly important.  It is interesting then to consider how the flow of library resources across regions may contribute to the organization of the library system as a whole.   Regional infrastructure will affect flows — but flows will also shape infrastructure.  Think, for example, of how the growth of rapid transit networks has transformed urban landscapes, encouraging the emergence of the sprawling metro areas that anchor mega-regions.

    Lorcan Dempsey sometimes speaks of ‘library logistics’ and it is in this context that I have been thinking about the flow of library resources and more especially about the emergence of new hubs around which the library system is now being reconfigured.  Thom Hickey’s recent experiments in programmatically identifying concentrations of material related to a particular topic or identity — what we’ve referred to as ‘centers data’ – provide a new way to think about how the library system is organized and how it is changing.  It’s not clear if the existing centers reflect intentionally cultivated strengths in institutional holdings, or if they are merely accidents of history – an unsolicited donation of materials about a particular person, for instance.

    In some cases, the association with known institutional centers of excellence seems evident:  it is not surprising, for example, to find that the University of Pittsburgh has the largest collection of material by or about Gonzalo Rojas, a celebrated Chilean poet.  Pitt has been a National Resource Center on Latin America for decades and it stands to reason they hold significant collections of Latin American literature.  Would an expert in the field have predicted that Pitt, rather than the University of Texas (a distinguished center of Latin American studies), has the most comprehensive collection related to Rojas?  Perhaps – I don’t have the domain knowledge to have an informed opinion about the likely location of the most comprehensive Latin American poetry collections in North America. Significantly, though, Pitt ranks within the top collections (by size) of works by related poets:

    Gonzalo Rojas and related identities - Pitt holdings ranked against other WorldCat libraries

    Gonzalo Rojas and related identities – University of Pittsburgh holdings ranked against other WorldCat libraries

    As the figures here suggest, it is possible for a library to be a leading — or even the top-ranked–  ‘center’ of resources related to particular identity or topic without holding a vast number of titles.  This will obviously be true when the relevant oeuvre is limited:  to hold 100% of a small published record (a handful of titles, let’s say) is still significant.  What is more interesting is that the diffusion of library resources — the ‘scarcity’ that we find in institutional and within regional collections — effectively lowers the threshold for what constitutes excellence in institutional holdings.  In the example of Gonzalo Rojas, for example, Pitt’s 48 titles amount to less than 40% of the published works associated with the related VIAF heading.  Even so, this small collection is 75% larger (and 16% more comprehensive) than the related holdings at the Biblioteca Nacional de Chile — at least as they are reflected in WorldCat.  This is, at least to me, somewhat surprising.

    This raises the question of how centers or hubs are revealed in the information environment.  Effective disclosure of collections that are distinctive not because of their rarity but because of their ‘excellence’ or completeness will be important if libraries are to be recognized as preferred hubs in the larger supply chain, where commercial providers are still dominant.  Ideally, one would like to have relevant library suppliers revealed in the network at the point of need, in the flow of the researcher’s work – which is increasingly likely to be outside the library discovery environment.  How could this be done?  Where do library fulfillment options fit in the knowledge graph? Somewhere in the Wikipedia infobox?   In Google’s info cards?  Happily, greater minds than mine are working on this problem.  Of one thing, at least, I’m certain:  understanding and representing the relationships between identities and topics in institutional and in regional collections – understanding how different ‘centers’ are related – will lead to new insights about how the library system is, and will be, organized.

    Related posts:

    We Want to Send You to SemTechBiz

    April 29th, 2013 by Roy

    semtechSemTechBiz is a major conference for those who are using semantic web technologies like linked data, RDF, Schema.org, etc. It is being held June 2-5 in San Francisco and OCLC and LITA have teamed up to send a librarian there to share the good work that libraries are doing to produce and consume linked data.

    We will pay the expenses of the selected individual to attend the conference where they will also be afforded a lightning talk slot to highlight their work for conference attendees. This is the first “Library Spotlight on Innovation” that we jointly developed with SemanticWeb.com, the producers of the conference. Richard Wallis, our Linked Data Evangelist, was instrumental in putting this together.

    So are you doing something interesting with linked data? Or do you know of someone who is? If so, you can nominate yourself or someone else for this great opportunity. We want the broader world to know about how libraries are innovating with linked data.

    Related posts:

    MOOCs and Libraries: Next Steps?

    April 19th, 2013 by Merrilee

    [On March 18th and 19th, 2013 OCLC and the University of Pennsylvania Libraries held a forum on MOOCs and Libraries. This is the sixth in a short series of postings on that event. You can read other postings on this topic in the archives, and check out all of the deliverable on the event page.]

    For our MOOCs and Libraries event, it was important to come away with concrete of opportunities for librarians — hopefully now that we have a cohort of attendees (in person attendees, remote attendees, and those of you who have watched the videos, reviewed the Twitter stream, and read these summaries!) there are some positive and meaningful ways that librarians can engage with MOOCs. To help the end of the day on both Monday and Tuesday, my colleague Chrystie Hill led us in small group discussions. (We also tried to include the remote audience in the discussions, with mixed results).

    The questions for discussion were:

  • What have you learned here today?
  • What are the implications for your library?
  • What should you or your organization do next?
  • What are the key strategic moves that libraries should make in regards to MOOCs?
  • On the last point, the small groups were asked to come up with their top three recommendations. Then as a whole, we heard all the “top three” from each table. Not surprisingly, there was quite a bit of overlap, and my colleague Dale Musselman nicely transcribed and organized the outcomes into 9 rough categories.

  • Get the library involved
  • Start talking/collaborating/sharing between libraries
  • Take MOOCs
  • Get in front of licensing and access
  • Create MOOCs
  • Support MOOC faculty
  • Support MOOC students
  • Create in-person support opportunities
  • Re-assess library assumptions and practices
  • Of these, from my perspective, the things that every librarian can do is to take a MOOC, and contribute to the conversation by listening to others who have been invovled in MOOCs, and sharing information and experiences.

    My thanks to Chrystie for structuring and facilitating this sessions, and to Dale for helping to organize the outcomes document.

    You can take a look at the summary document and also the individual recommendations as contributed.

    You can also watch the video for even more detail.

    Related posts:

    Subsidence and uplift – the library landscape

    April 18th, 2013 by Constance

    Approximate location of maximum subsidence in the United States.
    Source: http://en.wikipedia.org/wiki/File:Gwsanjoaquin.jpg

    There’s been a lot of attention to geologic subsidence of late, what with all the sinkholes opening up in Florida, Louisiana and other places. Here in California, we are more often concerned with the gradual change in ground level due to the draining of aquifers that support large-scale farming.  From year to year, the difference in ground level may be nearly imperceptible but over the space of a few decades the landscape has been radically transformed.

    The subsidence metaphor was on my mind recently, as I was looking over some data compiled by my colleague Thom Hickey, documenting the usage of headings (subjects and names) in WorldCat. OCLC Research has done quite a lot of work exploring new approaches to managing subject and name authorities, notably in VIAF and FAST. I was interested to see how Thom’s data might be used to measure change — uplift and subsidence — in the library landscape. By computing the frequency with which FAST and VIAF headings occur in institutional collections cataloged in WorldCat, one can identify which libraries hold the most materials related to particular topics, places and people.  And this in turn provides a measure of the relative distinctiveness of library collections, judged not in terms of the ‘rarity’ of holdings but rather by the concentration of related content.

    It  seemed to me that Thom’s data might have something interesting to say about how the emergence of large-scale digitized book aggregations – HathiTrust, Google Books, etc — is altering the library environment.  It stands to reason that as these large hubs begin to consolidate content sourced from libraries (and, in Google’s case, publishers), they will displace traditional library ‘centers of excellence’ in some subject areas.  Those who remember the DLF Aquifer project will recall that the initial prototype was designed to pool digitized resources in a given subject area (initially American History, later narrowed to Abraham Lincoln and the US Civil War).  In the very large aggregations of HathiTrust and GoogleBooks, subject specialization has emerged more gradually.  There has not been much public attention to measuring the scope of subject-based collections within those aggregations, nor to benchmarking them against existing institutional holdings.*

    The FAST and VIAF centers data provide evidence of both subsidence and uplift in the current collections environment — that is, shifts in centers of excellence as measured by scope of subject based holdings.  The ‘re-leveling’ that has been wrought in just a few years of large-scale digitization is already significant.  Digital aggregations have, by design or accident, emerged as important subject repositories that rival and even outrank some of the largest institutional libraries in WorldCat.

    For instance,  HathiTrust, an organization not yet five years old, already holds the greatest concentration of titles on the topic of marine biology, surpassing the Library of Congress as well as two major research universities with world-class oceanography programs.

    Marine biology

    In the case of Marine biology, the difference between the number of titles held by HathiTrust and the Library of Congress is not very large — fewer than 200 titles.  But in other instances, the relative subsidence of traditional centers of excellence is more dramatic.  For instance, Google Books substantially outranks several major research libraries in holdings related to Russian periodicals (journals, newspapers and the like).

    Russian periodicals

    This represents an important change in the library system, with monumental old hubs being progressively overshadowed by new collections that are produced not by the slow accretion of library acquisitions but by large-scale digitization and (re)aggregation.  It provides a compelling illustration of how Web-scale content aggregations are altering the library operating environment.  In the case of HathiTrust especially, this disruption can (and I think should) be seen as a positive change:  it enables libraries to rethink traditional, institution-scale collection management and stewardship — a topic we examined in our Cloud-sourcing Research Collections report some years ago.

    Using Thom’s ‘centers’ data, we can identify hundreds of topics and identities for which HathiTrust offers better coverage than any other library in WorldCat.  Here a few topics in which the Digital Library distinguishes itself:

    Hathi top topics

    And a few of the personal names for which its coverage is unrivaled:

    Hathi top names

    Interestingly, the other top-ranked collections (by size) for these same subjects and identities are not always the source of HathiTrust’s richness.  One might have anticipated that Hathi’s leadership was simply a by-product of aggregating content from existing centers of excellence, but in fact Hathi has developed unexpected strengths by aggregating at a very large scale from a diverse pool of contributors.  For example, Harvard University and the University of Michigan each hold sizable collections of works by the poet Jean Ingelow; yet, the richness of Hathi’s Ingelow collection is mostly due to contributions from campus libraries in the University of California system.

    The FAST and VIAF ‘centers’ data provide a fascinating new vantage point on the changing collections landscape.  We’ll be looking at ways to integrate it into ongoing research projects, including the mega-regions work, where we hope it can help us detect regional collecting trends that might inform shared stewardship priorities.

    *Note:  HathiTrust provides a nice visualizations and a list of subject areas in the Digital Library, based on Library of Congress classification numbers.  These provide a good overview of subject-based coverage but without reference to comparable coverage in other libraries. It is generally known that Google is selective with respect to identifying library partners, but I’m not aware of any public documentation related to a specific collection development strategy. Their aim, famously, is to provide comprehensive coverage of the world’s books, not to develop excellence in any given subject area.

    Related posts:

    MOOCs and Libraries: Who Are the Masses? A View of the Audience

    April 17th, 2013 by Merrilee

    [This is the fifth in a short series on the forum on MOOCs and Libraries held by OCLC and the University of Pennsylvania Libraries, March 18th and 19th, 2013. Look back to the archives for earlier posting on this topic]

    MOOC Audiences
    MOOC Audiences by *s@lly*, on Flickr, cc-by-nc

    We wrapped up the content portion of our meeting by reflecting on the audience for MOOCs (or what we know about participants), and also considering the audience through the lens of public libraries (which I admit, I don’t think about a lot of the time, except when I’m acting as a patron). I find the role (or potential role) for public libraries in MOOCs to be very exciting, and I think you’ll see why if you read the summary or watch the video of Margaret Todd’s talk.


    We heard first from Howard Lurie (Vice President, Content Development, edX), who said he was from the “other platform.” (Although we tried to balance the program, many of the presenters were from “Coursera” institutions — I don’t think the platform matters all that much when talking about the library’s role in MOOCs, but there you have it.) According to Lurie, MOOCs provide an opportunity to look at learning/pedogogy through the lens of “big data” gathered during course implementation. Even with “low” completion rates the numbers are still quite high. Taking one edX class as an example (6.002x, Circuits and Electronics): 154,763 registered for the class; 26,349 tried the first problem set; 10,547 took the mid term; 9,318 passed the mid term; 8,240 took the final exam; 7,157 received certification. This is a lot of data to analyize that could help improve teaching — it would take many years of iterating a class in a traditional setting to get to those numbers (and of course, Audrey Watters would ask as she did in an excellent talk at WebWise, “Whose educational data is it?”) As with many presentations on MOOCs I’ve heard recently, Lurie highlighted “global stories” reflecting exposure to a topic for those who might not have had the opportunity otherwise (such as the Pakistani participant who said “this course was the most important experience in my life” One role for MOOCs might be to gather stories of learners around the world, and help universities identify talent.

    Next we heard from Deirdre Woods (Interim Executive Director, Open Learning Initiative, University of Pennsylvania) who joked that in this environment, being around for a number of months makes you an old timer. Penn’s is in the open learning business because it’s a public good, and also because it’s good for Penn — online courses provides prospective students a taste of what they might expect from the college before they make a commitment. It’s also a good way to stay in touch with alumni. Woods shared that the faculty who have taught MOOCs acknowledge that it’s a huge undertaking but all say they would do it again. Part of the satisfaction? Faculty members reach more people in one course than in entire career. A little more about participants demographics: the majority of participants in Penn MOOCs were working in full time positions. 65% are male (they aren’t sure why). 30% of participants hang in through the duration of the course, but don’t do assignments.

    Finally, we heard from Margaret Donnellan Todd (County Librarian, County of Los Angeles Public Library). Right now, LACoPL is loved and trusted by the community; as evidence of this, county residents recently voted to increase library funding. Not content to rest of their laurels, LACoPL has identified online relevance as an important component of their strategic plan. And the library has a challenge in terms of serving an educational shortfall in their community. Right now high school dropout rates are at about 40%, and community college graduation rates are low. Higher education in California (as elsewhere) is increasingly squeezed, and the option to go to community college in order to catch up, is no longer an option for all. All of these factors will lead to a decline in a local qualified workforce. With fewer and fewer options, LACoPL has begun to see itself as a center of learning, and positioning itself to support very practical and real educational educational needs. Public libraries excel at connecting people to services, partners, and peers. At an academic institution, MOOCs are an extension of existing online presence; in public libraries, MOOCs (and support for those taking MOOCs) may be an extension of their broad public education mission. Todd described how LACoPL has experimented with offering course through Ed2Go — even with little promotion, these courses have been very popular. What might be possible if public libraries extended their online courses, or worked with material being produced in MOOCs? One desire expressed by participants in MOOCs has been a need for a common space to come together with others taking the same class. Why not the public library as that space?

    N.B. You may have noticed that in these postings, I purposefully am referring to those who take MOOCs as “participants” and not “students.” That’s a purposeful choice on my part. I don’t think we know enough about who is taking MOOCs and why to label them as students yet.

    Related posts:

    MOOCs and Libraries: New Opportunities for Librarians

    April 16th, 2013 by Merrilee

    [This is the fourth posting in a short series on the forum on MOOCs and Libraries held by OCLC and the University of Pennsylvania Libraries, March 18th and 19th, 2013.]

    This, alongside the copyright session, was the most meaty in terms of seeing where libraries are currently connecting with MOOCs — as I learned during my investigations, there are a lot of people with opinions about MOOCs and libraries, but not many folks with hands on experience. This session focused on where library research skills fit into MOOCs, where that might take us.

    The panel was moderated by Marjorie Hassen (Director of Teaching, Research, and Learning Services, University of Pennsylvania Libraries) with participation by Sarah Bordac (Head, Instructional Design, Brown University), Jennifer Dorner (Head, Instruction and User Services, University of California Berkeley), and Lynne O’Brien (Director of Academic Technology and Instructional Services, Duke University). You can watch the video and / or read my summary of the event below.

    This panel featured perspectives from both Coursera (Duke and Brown) and edX (UC Berkeley) institutions, as well as from librarians who have been involved with a number of courses (Duke) to those who are still preparing for launch day (Brown). For those who have been in the game or on the sidelines, “MOOCs create the perfect storm for new ways of thinking about things” quipped O’Brien. And if people go to MOOCs to learn, it’s critical for libraries to be involved. The question is, what is the right level of support, and where to invest? As a first steps, the pedagogical needs for a course need to be outlined before you can judge what the role of the libraries is, and where library support makes sense. For example, at Berkeley, courses on math and computer science don’t have library related learning objectives. A good exercise for those at academic institutions might be to scan the course catalog and ask what library support is currently offered for each course — in an online environment, expect support to be similar. Is the main focus of support given to faculty who are planning the course, or to participants who are taking the course? For those who are taking courses, librarians may serve a role that’s more like an information guide rather than an information provider.

    Certainly looking anew at teaching creates opportunities for cross campus teams. At Brown University (and elsewhere), the library is involved in a number of these teams, which positions the library strategically and helps the library act in a “connector” role. At some institutions, such as UC Berkeley, online learning has not been centrally coordinated, which allows for creativity in course development but makes it difficult for the library to get involved.

    Dorner shared information about two library-based edX groups, one studying “content accessibility” (copyright) and another looking at “research skills.” Both groups will issue reports and recommendations, and those reports will be shared.

    Other observations:

  • You can’t fully understand and appreciate any technology unless you use it. In MOOCs, there are two layers of experience — that of an course participant, and that of an administrator on the platform.
  • Another reason to take a MOOC — you can see the degree to which students share information resources among themselves.
  • Additional resources:
    Study of how MOOC participants (in one course) went about finding relevant information resources (via Eleni Zazani). A small sample size, but I think this gives some indications of where students are headed.
    A thorough analysis of a MOOC, the report on Duke’s Bioelectricity course — this is the most through reporting out from a course I’ve seen to date.

    Related posts:

    MOOCs and Libraries: Production and Pedagogy

    April 15th, 2013 by Merrilee

    [OCLC and the University of Pennsylvania Libraries held a forum on MOOCs and Libraries on March 18th and 19th, 2013. This is the third in a short series on that event.]

    One of the great advantages to partnering with the University of Pennsylvania on this event is that they have been through several rounds of course production, so they know the ropes. And even though this event focused on MOOCs and libraries, we did think it would be good for the audience to learn a little bit about course production. Like most of our attendees, I have no experience with MOOC production (although I have taken three MOOCs — I stop at nothing in my quest to bring you information!). Having had some experience on the student or participant side, it was great to glimpse behind the curtain.

    I’ll summarize this session below, but here’s my advice for learning more if most of what you’ve done is read about MOOCs in the press. Take a class or two (while you do it, try to think about the role of the library in relationship to the learning objectives for the class). And watch this session to learn a little more about the variety of production styles, and what goes into making a MOOC.

    The panel was expertly moderated by Bruce Lenthall (Director of Center for Teaching and Learning) and included participation by Christian Terwiesch (Wharton School Faculty), Jackie Candido (Online Learning & Digital Engagement, School of Arts and Sciences), Amy Bennett (Penn Open Learning), and Anna Delaney (Perelman School of Medicine).

    Before the panel discussion Terwiesch spoke briefly about his experience teaching a Coursera class called “Introduction to Operations Management” that is an adaptation of a course he has been teaching for some time at the Wharton School. From his perspective, the economics of MOOCs are simple: more learning with the same resources. He wants everyone to think about process management principles, in order to make life better, and MOOCs are a great way to do that. (I have to admit having heard him speak passionately about his class, I’ve rashly signed up to take it — maybe some of you will, too?)

    The panel offered advice and perspectives on production, covering some basics. The ideal timeline for production is about six months: build, promote, enroll (although it can be done in less time). MOOCs are more than just a professor in a video frame — they need instruction design. Streamlining course content is critical with MOOCs — it doesn’t work to take an existing class and plug it into a MOOC. Faculty content is the most obvious component but it’s not all. In an online environment, clear written communication is key. Having a good microphone and a way to engage with the students (forum, blog, wiki, but something that will work at scale!) are two very critical components. Be on guard against technical gotchas. Pay attention to small details; remember that once the material is out, it’s out!

    There was an interesting thread around “success” — what are measurements to know when you are there? Returning to the theme for copyright, panelists suggested that “it depends!” Part of this is related to goals set by faculty for students, and therefore is a mater of personal style and preferences. Terwiesch suggested that success is changing what you do for the better. The right team (which right now is people doing work on top of their regular job, with no additional funding) is critical for success. Most good team members are described as doing the work because they are dedicated and passionate (and I would add, they probably are not intimidated by experimentation). The question of completion rates as a measure of success came up, and panelists (and others) pushed back on this: it’s not appropriate to assess completion of a MOOC with the same metrics used with traditional classrooms or even with a “traditional online course” (I love this phrase!).

    The panelists also shared what they thought might be key roles for libraries. One area highlighted was organizing and making sense of information contributed by participants; in the Terwiesch course, there was a whole range of user generated content on process management. The suggestion of this type of curation on a massive scale got pushback from the audience, as did the idea of having embedded librarians (“with thousands of students, would we have enough staff?”). Other ideas seemed more attainable: providing pointers to open resources for faculty, and pointer to online communities and other resources for students (perhaps in a dedicated discussion thread). Helping to educate course TAs about resources for to students. Helping to structure discussion forums ahead of time (speaking from personal experience, these can be very wild and woolly).

    In summary, all of the panelists conveyed their enthusiasm about MOOCs. Despite relatively low levels of completion, they were energized by the large numbers of highly engaged. In the end, it’s not so much about massiveness, but about human connection and excitement that can be generated, and the community that can be formed.

    Related posts: