Digital Scholarship@Leiden

After the Collaborative Transcription: addressing the challenges of crowdsourced transcription

After the Collaborative Transcription: addressing the challenges of crowdsourced transcription

Leiden University Libraries support transcription of digitised materials in various ways, including methods that enable crowdsourcing. A recent international symposium and the exchange of experiences, produced insights that will help to improve how Leiden can respond to these challenges.

Researchers can collect their data in collaboration with other researchers or, more broadly, with "the crowd". At the Centre for Digital Scholarship (CDS) we are happy to help with transcription, and crowdsourced transcription, in research projects whenever we can.

Leiden University Libraries have supported crowdsourcing and transcription in various ways, as have other institutions around the world. In April 2022, Ben Companjen, a Digital Scholarship Librarian in the Use of Digital Data team, participated in an online symposium focusing on various aspects of using crowdsourcing for transcription of digital collections. Aptly called After the Collaborative Transcription, the symposium focused on the question of how such contributions can eventually be handled.

In this post, Ben discusses how Leiden University Libraries have been involved in crowdsourcing and transcription, and shares some of the insights emerging from the symposium.

Transcription support at the Centre for Digital Scholarship

The Centre for Digital Scholarship (CDS) has provided support for transcribing digitised materials for years, together with our colleagues in other parts of the Library (UBL). One group of researchers has used a specially developed SharePoint-based virtual research environment to transcribe North-Korean propaganda posters, for example. Others have used the Mirador IIIF viewer to transcribe Abnormal Hieratic writing at the level of words.

A few years ago we installed FromThePage, an open-source platform for transcribing documents. Whilst officially we are still testing it, the platform has been very stable.

FromThePage allows transcription (or OCR correction) projects to be open to all users, or only to selected users, which has meant that lecturers have been able to introduce students to the task of making transcriptions by including practical activities in their coursework. When students are given feedback on their transcriptions, they learn how to improve their reading of manuscripts.

By using the IIIF APIs, images can be imported and exported into applications like FromThePage, and this has meant we can integrate our Digital Collections platform with other applications, like FromThePage.

One of the great advantages of FromThePage is that we can also invite the ‘the crowd’ to help us with the transcription of our collections, especially the manuscripts and the printed documents with low-quality OCR. Starting and running a crowdsourcing project requires good preparation and among other things, it is important to make a decision about what needs to happen with the results.

What do you do After the Collaborative Transcription?

After presenting the technical integration of FromThePage to Leiden's repository in a webinar organised by FromThePage in January 2022, I was invited to the symposium "After the Collaborative Transcription", which focused on this very question: What can we do with the results of crowdsourcing projects?

The symposium was held online, on 19, 20, and 21 April, 2022. It was organised by Allyssa Guzman and Albert A. Palacios, who lead the project Enabling and Reusing Multilingual Citizen Contributions in the Archival Record at Texas A&M Libraries. The symposium was one of the final activities in this project.

The opening keynote presented the state of the crowdsourcing field. It discussed the latest technological developments in platforms, and it also highlighted trends in contributions by the crowds. In general, using crowdsourcing has become more accepted and attainable for cultural heritage institutions and many lessons have already been learned and shared, for example in the Collective Wisdom Handbook.

However, there continue to be many unanswered questions, and some of the older insights have become unsatisfactory following recent developments surrounding crowdsourcing platforms. The discussion sessions in the symposium started with the following questions:

  • How do you integrate transcriptions into the collections management system?
  • When asking students to contribute, what do you need to consider and what can you offer?
  • How should you credit contributors? (Are they “contributors”, “volunteers”, or maybe “volunpeers”?)
  • What is a meaningful contribution in a crowdsourcing (transcription) project?
  • What (copy)rights apply to transcriptions and translations and how do you manage these?

The basic answer to most of these questions is "it depends". The discussions did provide some longer answers, from various perspectives, but I will not try to summarise the full discussion, as the main points will also be included in a symposium report, which will appear in due course.

Connecting user-contributed information to digital cultural heritage

Even though I felt that all the questions were important and that the discussions were very insightful, my most immediate interest was in learning about others' experiences with integrating the results of collaborative projects into collection management and presentation systems. I have been involved in integrating researchers’ descriptions of specific collections into the Digital Collections repository a couple of times and as the systems and processes for managing such metadata were not set up to accept the results from these bespoke projects and tools, it was often difficult to integrate the data.

Hearing that several institutions participating in the symposium had different ideas on how to integrate results from collaborative projects, I felt relieved to hear that my experience was a shared by others.

Shared experiences and good practices are also being exchanged in the Digital Heritage Network (NDE) in The Netherlands, a collaboration of heritage institutions that aims to improve the visibility, usability, and preservation of (information about) digital heritage.

In the session on integrating project results into repository systems I suggested that transcriptions should not be viewed as metadata or as fully separate works, but as annotations of the originals. This is not only recommended by the NDE standards, but also core to IIIF, which is widely used in repositories for presenting digitised collections.

Although the UBL have been active in the NDE for years, we officially joined when the UBL signed the Digital Heritage Network Manifesto in November 2021. This partnership implies an indirect promise to facilitate the process of adding and linking information about the collections, for all users of our digital collections. This does not mean that we will have great support for large crowdsourcing projects of our collections soon, but it does mean, in my view, that we will continue to find better ways of connecting the results of research with our collections.

After "After the Collaborative Transcription"

It was an honour to be invited to the symposium and to discuss the central challenges with leaders in the field of crowdsourced transcriptions. There was a lot of collective wisdom in evidence, both from those present in the video conference and in the many sources being exchanged.

A couple of months after the symposium I still feel inspired to apply all of this wisdom to future collaborative data projects, so please get in touch with me about your project ideas!