The OCLC Research Library Partnership Web Archiving Metadata Working Group (WAM, of course) was launched last January and has been working hard–really hard–ever since. Twenty-five members from Partner libraries and archives have dug in to address the challenge of devising best practices for describing websites–which are, it turns out, very odd critters compared to other types of material for which descriptive standards and guidelines already exist. In addition, user needs and behaviors are quite different from those we’re familiar with.
Our plan at the outset: do an extensive literature review on both user needs and existing metadata practices in the web context, study relevant descriptive standards and institution-specific web archiving metadata guidelines, engage the community along the way to confirm the need for this work and obtain feedback, and, ultimately, issue two reports: the first on user needs and behaviors specific to archived web content, the second outlining best practices for metadata. The heart of the latter will be a set of recommended data elements accompanied by definitions and the types of content that each should contain.
At this juncture we’ve drawn several general conclusions:
- Descriptive standards don’t address the unique characteristics of websites.
- Local metadata guidelines have little in common with each other.
- It’ll therefore be challenging to sort it all out and arrive at recommended best practices that will serve the needs of users of archived websites.
We’ve reviewed nine sets of institution-specific guidelines. The table below shows the most common data elements, some of which are defined very differently from one institution to another. Only three appear in all nine guidelines: creator/contributor, title, and description.
Collection name/title | Language |
Creator/contributor | Publisher |
Date of capture | Rights/access conditions |
Date of content | Subject |
Description | Title |
Genre | URL |
Our basic questions: Which types of content are the most important to include in metadata records describing websites? And which generic data elements should be designated for each of these concepts?
Here are some of the specific issues we’ve come across:
- Website creator/owner: Is this the publisher? Creator? Subject? All three?
- Publisher: Does a website have a publisher? If so, is it the harvesting institution or the creator/owner of the live site?
- Title: Should it be transcribed verbatim from the head of the home page? Or edited to clarify the nature/scope of the site? Should acronyms be spelled out? Should the title begin with, e.g., “Website of the …”
- Dates: Beginning/end of the site’s existence? Date of capture by a repository? Content? Copyright?
- Extent: How should this be expressed? “1 online resource”? “6.25 Gb”? “approximately 300 websites”?
- Host institution: Is the institution that harvests and hosts the site the repository? Creator? Publisher? Selector?
- Provenance: In the web context, does provenance refer to the site owner? The repository that harvests and hosts the site? Ways in which the site has evolved?
- Appraisal: Does this mean the reason why the site warrants being archived? The collection of a set of sites as named by the harvesting institution? The scope of the parts of the site that were harvested?
- Format: Is it important to be clear that the resource is a website? If so, how best to do this?
- URL: Which URLs should be linked to? Seed? Access? Landing page?
- MARC21 record type: When coded in the MARC 21 format, should a website be considered a continuing resource? Integrating resource? Electronic resource? Textual publication? Mixed material? Manuscript?
We’re getting fairly close to completing our literature review and guidelines analysis, at which point we’ll turn to determining the scope and substance of the best practices report. In addition to defining a set of data elements, it’ll be important to set the problem in context and explain how our analysis has led to the conclusions we draw.
So stay tuned! We’ll be sending out a draft for community review and are hoping to publish both reports within the next six months. In the meantime, please send your own local guidelines, as well as pointers to a few sample records, to me at dooleyj@oclc.org. Help us make sure we get it right!
Jackie Dooley retired in from OCLC in 2018. She led OCLC Research projects to inform and improve archives and special collections practice.