Time Estimation for Processing Born-Digital Collections

Making informed and accurate projections of skills, time and materials needed for projects is key to responsibly managing the resources and collections entrusted to us. In the realm of archives and special collections, there are good models out there for time estimation and planning for processing paper-based archival collections and cataloging books, but there aren’t clear models yet for born-digital collections. To support the larger efforts of our Collection Building and Operational Impacts Working Group, we were interested in how people are thinking about time estimation and planning for born-digital. To learn more, we hosted a series of calls on the topic with colleagues who work with electronic archival records in the RLP community. While our conversations made clear that this is still an evolving area of work, there were consistent experiences across programs that are worth highlighting and identifying for further exploration.

Our conversations included people in a variety of roles. Many hold Digital Archivist positions, with others in a range of roles devoted to the arrangement, description, preservation and administration of born digital archival collections. The maturity of institutional programs addressing born-digital varied as well. Some participants were filling a newly created Digital Archivist role at their institution, and were building the foundations of their program. Others were involved with more robust programs with multiple people and five or more years of effort in this area.

Most programs represented were not doing detailed, ongoing time tracking for born-digital collections work. Several had done some ad-hoc or project specific tracking to help them inform tool selection or workflows. Regardless of formal tracking projects, participants had plenty of relevant experience to inform our conversation. We asked three main questions to focus our discussion. A summary of responses to each of these questions, along with some high-level takeaways, are laid out below.

What are the functions that you are tracking? Are there identifiable segments? Who is responsible for these functions?

Accessioning versus processing

The group drew a clear distinction between accessioning and later processing and descriptive work. Accessioning includes actions to transfer files from source media, stabilize them, create baseline metadata and understanding of the content of the collection, do appraisal, and transfer to secure storage. Processing involves doing more thorough descriptive and sometimes arrangement work, reconciling and sensemaking between paper and born-digital in hybrid collections, understanding and enacting access restrictions, and otherwise readying collections for use. Within each are smaller, discrete workflows and tasks.

Active versus passive time

Another important distinction emerged between passive time and active time working on born-digital collections, or machine time versus human time. Work with born digital requires running multiple automated processes and programs. Once started, these require infrequent human intervention or monitoring. These are distinct from actions that require the sustained attention and professional judgement of an archivist, like problem solving and troubleshooting, making appraisal decisions, or analyzing and synthesizing content and writing descriptive text. Both human and machine capacities will impact the time required to address the needs of a collection, and they may differ significantly from one another.

Responsibility

Responsibility for born-digital work varied across institutions, but most had a dedicated position that dealt exclusively with born-digital collections. Many of the academic institutions rely heavily on student labor to do much of the detailed work of transferring files from legacy media or imaging disks.

Is there a meaningful unit of measurement that can be used for time estimation purposes, like linear feet can be used for paper collections? Does the idea of “levels of processing” make sense for born-digital?

In the domain of paper-based collections, there has been quite a bit of work in the last decade to think through and specify levels of processing, to aid in assessing the level of effort a specific collection might warrant. These models include an estimate of hours per linear foot required to process at each level. And thus (collection extent) x (hours per linear foot for processing level) has become a simple shorthand for general estimation of processing times. For born-digital collections, the math is not that simple.

Size doesn’t matter (at least not much)

Extent is a factor, but only one of multiple variables that can impact the time it will take to accession and process a collection. Participants identified other important factors that impact time and effort:

Carrier Types – Transfers from legacy removable media like floppy and zip discs or optical media take significantly more time than transfers from more contemporary, readily accessible media. The smaller storage size of legacy media requires many, many more transfers than getting the same amount of data off of a hard drive or network. And legacy media often requires more trouble shooting time and more complicated workflows or specialized equipment.
File Formats – The types of file formats a collection comprises will have considerable impact on necessary work, for a variety of reasons. Different file types require different processes. For example, PDF documents may require OCR scanning, Word or other text files won’t require OCR but may require scanning for personally identifiable information (PII). And processes that are required for all file formats, like virus scans, calculating checksums, and transfer to secure storage will take much longer for some kinds of files than others. Participants specifically called out AV file formats as requiring considerably more time than others.
Homogeneity of File Formats – A body of the same file types means you can automate more work and spend less time decision-making. A great variance in file types across a collection may mean more time spent initiating and executing different processes for different file types and more need for interaction with individual or small numbers of files.
Uncommon File Formats and Carriers – Most programs have capacity to deal with a body of common file formats and carriers that exist in the majority of their collections. When the need to deal with new, uncommon, or especially complex file formats or obscure original media arises, it may require experimenting with existing tools or finding and testing new ones. For example, proprietary file formats that have been created for use in a specific program (proprietary CAD files, Final Cut Pro files, etc.) must be migrated to a more common and open file format so that they can be accessed and preserved. Deciding on an appropriate strategy can require substantial research and often there is no easy way to automate the migration process. This kind of troubleshooting and problem solving can translate to days or weeks of additional effort.

Levels of processing → Levels of effort

Levels of processing are a useful framework for developing a processing approach for born-digital collections. And like analog collections, the amount of work it will take to achieve a particular level of processing depends on the extant level of organization and understanding of a collection when it arrives at the repository. However, levels of processing don’t as cleanly translate to standard time estimates for born-digital collections.

An idea that resonated in our conversations was one of levels of effort, to be considered alongside levels of processing. Levels of processing generally articulate choices about the degree of granularity with which an archivist will perform arrangement and descriptive work. Levels of effort would address the broader range of choices required for born-digital about what actions to take to address preservation, risk mitigation, and access concerns, and consider whether those actions can be automated or require direct human interaction. Colleagues from University of Minnesota shared a report from their Libraries Electronic Records Task Force, which reports on a year’s worth of work on born-digital collections across their repositories, and includes an example of a level of effort framework.

Is there anything about processing born-digital that varies significantly from paper, that impacts the way that you’re thinking about time and effort for processing born-digital?

Accessioning and appraisal

Those who tracked time consistently found that they are spending significantly more time on accessioning than processing activities, and that born-digital requires an expansive approach to accessioning. Because of the relative fragility of electronic records, to ensure their existence into the future, more work to transfer and stabilize these collections must be done at the outset than we are accustomed to with most paper collections.

A major portion of time spent accessioning is devoted to appraisal. Because born-digital records cannot be as easily examined in situ, often, less appraisal work happens before a collection is brought in than is usual in a paper context. This means there is a heavier appraisal burden at point of accessioning, and can shift the responsibility for appraisal from the curator to the digital archivist. Additionally, appraisal for born-digital requires more mediation, by tools and people. Tools are required to look at collections, both to render a file so it is readable by a human, and to look computationally across a body of files. And humans with the skills to run those tools and understand their outputs must collaborate with the people who are tasked with making curatorial decisions, to make informed assessments about what warrants keeping.

Workflows for appraisal, and especially the collaborative points in the process, are a challenge and still evolving. Many participants on the call generally do not work on a collection until after it has been formally brought in, and expressed a desire to be involved earlier in the process of surveying and assessing potential collections.

The right tool for the job

Work with born-digital collections is tightly tied to the tools a digital archivist has at their disposal, and how well those tools do or do not address the needs of a specific collection. Many of the tools used by digital archivists were made for another purpose or community, like digital forensics for law enforcement, and must be adapted to archival needs. Figuring out the right suite of tools for your collections, staffing, budget, and technical infrastructure takes time and experimentation, as does troubleshooting when something doesn’t go as planned.

Access and risk assessment

Archivists are thinking differently about privacy considerations for born-digital collections. The potential to offer access to these materials online means that researchers could more easily find them, do full text searches, computational analysis, and generally have quite detailed means of access. When compared to the reading room only access of most analog collections (what one participant called “privacy by obscurity”) this opens up greater risk of exposing private, sensitive, or legally protected information. This translates to doing more work up front to identify PII, information protected under FERPA and HIPAA laws, and other sensitive information.

Born-Digital FOMO

Experienced archivists used to working with paper collections have developed reliable methods to analyze and understand collections in the aggregate. And when in doubt, they are able to open a box, pull a few folders to spot-check the contents, and proceed with their work without having to look at the majority of the paper in a collection. This kind of quick visual check isn’t as easy, and participants spoke of not trusting their instincts in the same way, with born-digital. One participant called it “a kind of born-digital fear of missing out,” that can lead to spending more time than needed or warranted opening and looking at individual files. Participants also called out the challenge of the cognitive shift required to move back and forth between the granular, file-level analysis of digital forensics tools and the aggregate-level thinking required for the archival sensemaking work that is core to arrangement and description.

Evolving Programs

Though the original goal of these conversations was a relatively narrow exploration of time estimation for work on born-digital collections, they surfaced considerable insight into the state of programs to address born-digital, where we are having successes and challenges, and where there is more work to be done.

Still experimenting

Even the most mature programs represented in our calls are still figuring many things out. Participants pointed to the value of hands-on experimentation with collections to help them do this. Those who started by developing model workflows found that they required much revision when put into practice. There is so much that is new and uncertain, that most participants said they had to work through multiple collections to start to understand what they might need on a programmatic level. Once they worked through a number of collections, they were able to more realistically categorize work, define workflows, and assign responsibilities.

Figuring out what work is warranted

In addition to understanding what workflows and tools are needed, we are also still developing a sense of the judgement calls involved in this work. With digital forensics tools, we have the capacity to work at an incredibly detailed level. Participants spoke about the need to resist the urge to get into the weeds, and to assess what level of work is warranted, rather than what level of work is possible. A key example of this is the current conversation around the necessity of disk imaging. One participant offered that they have shifted their approach to imaging: “We only disk image when there is a compelling case that this is necessary, otherwise we just logically transfer the files for reasons of efficiency. There isn’t a great enough justification to invest the levels of time, effort, and storage that a disk image requires in all but the most high profile collections where there is actually an evidentiary value inherent to the media we’ve received.” As programs continue to mature, increasingly nuanced professional judgement calls like this are sure to continue to evolve.

Relatedly, participants expressed a desire to better understand the professional judgement required of different workflows so that they could better advocate for appropriate staffing resources and responsibility models to best steward their collections.

Isolation → Integration

A number of institutions remarked that their processes for born-digital are currently quite siloed from work with other formats, even with hybrid collections, and expressed a desire to develop more integrated approaches. Some of the work of born-digital accessioning and processing is hard to figure out because technical services work doesn’t happen in isolation, it is dependent on curatorial decisions and understanding of access and use needs. As responsibility for born- digital becomes more distributed and more connected to the rest of our program, it will become easier to see the big picture and make more systematic decisions.

These conversations were quite valuable, serving our original goal of better understanding time and effort estimation for born-digital processing, and unexpectedly enlightening about the larger picture of the evolution of work in research libraries to address born-digital archival collections. I’m grateful to the participants of our calls for sharing their experience and insight. Do these observations reflect what you are experiencing in your institution? Let us know in the comments.

Chela Scott Weber

Chela Scott Weber is a Senior Program Officer for the OCLC Research Library Partnership, where she focuses on issues related to archives, special, and distinctive collections.