I’m a great fan of the National Library of Australia’s Trove, a single search interface to 122 million resources—books, journals, photos, digitized newspapers, archives, maps, music, videos, Web sites—focused on Australia and Australians. You can search the OCR’d text of over 45 million newspaper articles that have been digitized.
OCR is not perfect. The original document is juxtaposed with the OCR transcription so errors are immediately apparent. Since the Australian Historic newspapers public launch in July 2008*, people have been correcting errors in the OCR’d text. Both the corrected text and the original text are indexed and searchable.
The enthusiasm of these public text correctors is amazing! The 15 March 2011 Trove newsletter notes:
Text correctors are still doing an outstanding job of improving the electronically translated text, and the number of corrections each month continues to increase. In January we had over 2 million lines of text corrected in a month for the first time, which continued through February. The running total of corrected lines has now reached 31 million!
One of the issues the RLG Partners Social Metadata Working Group addressed was to what degree moderation was needed when opening up the descriptions of cultural heritage resources to user contributions. The responses to the social metadata working group’s survey of site managers indicated that spam or “inappropriate behavior” was not a problem. Rose Holley (a member of the working group) provides additional corroboration that spam and derogatory comments were not a big problem after a careful review of comments.
…recently we made a decision to manually review the 18,000 comments that have been added to newspaper articles and other items in Trove. We found only 114 spam comments with URL that were removed and 71 comments placed by the same user in the same week that breached our terms and conditions (derogatory). These were also removed.
We thought that was very good news and supported our theory that moderation is still not required. We have however added a feature that enables a user to easily report spam via the trove forum.
This supports one of the working group’s recommendations: Go ahead! Invite user contributions without worrying about spam or abuse.
* For more details, see Holley, R. (2009) Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers, National Library of Australia, ISBN 9780642276940 http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf.
Karen Smith-Yoshimura, senior program officer, works on topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements.