Iâ€™m a great fan of the National Library of Australiaâ€™s Trove, a single search interface to 122 million resourcesâ€”books, journals, photos, digitized newspapers, archives, maps, music, videos, Web sitesâ€”focused on Australia and Australians. You can search the OCRâ€™d text of over 45 million newspaper articles that have been digitized.
OCR is not perfect. The original document is juxtaposed with the OCR transcription so errors are immediately apparent. Since the Australian Historic newspapers public launch in July 2008*, people have been correcting errors in the OCRâ€™d text. Both the corrected text and the original text are indexed and searchable.
The enthusiasm of these public text correctors is amazing! The 15 March 2011 Trove newsletter notes:
Text correctors are still doing an outstanding job of improving the electronically translated text, and the number of corrections each month continues to increase. In January we had over 2 million lines of text corrected in a month for the first time, which continued through February. The running total of corrected lines has now reached 31 million!
One of the issues the RLG Partners Social Metadata Working Group addressed was to what degree moderation was needed when opening up the descriptions of cultural heritage resources to user contributions. The responses to the social metadata working groupâ€™s survey of site managers indicated that spam or â€śinappropriate behaviorâ€ť was not a problem. Rose Holley (a member of the working group) provides additional corroboration that spam and derogatory comments were not a big problem after a careful review of comments.
â€¦recently we made a decision to manually review the 18,000 comments that have been added to newspaper articles and other items in Trove. We found only 114 spam comments with URL that were removed and 71 comments placed by the same user in the same week that breached our terms and conditions (derogatory). These were also removed.
We thought that was very good news and supported our theory that moderation is still not required. We have however added a feature that enables a user to easily report spam via the trove forum.
This supports one of the working groupâ€™s recommendations: Go ahead! Invite user contributions without worrying about spam or abuse.
* For more details, see Holley, R. (2009) Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers, National Library of Australia, ISBN 9780642276940 http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf.Related posts: