RESILIENCE Tool: eScriptorium

EPHE

The Digital Humanities team at the École Pratique des Hautes Études (EPHE) – University PSL has been developing this cutting-edge Deep Learning software for automatically reading and transcribing documents in many different scripts and languages.

eScriptorium

The purpose of eScriptorium is to provide as complete as possible a workflow for the production of digital editions. The first step in this is the transcription of primary sources, and this is the part that the team has been focussing on to date. This block in the workflow is now functioning and being tested on a wide range of scripts, and soon we will also have the annotation of images along much the same principles as those of the Archetype project, and the annotation of texts according to the TEI standard for adding philological, historical, linguistic, palaeographical and other information.

Deep Learning and Artificial Intelligence

The transcription part of eScriptorium draws on principles and techniques of deep learning and artificial intelligence. In order to use it, as users you can train the machine to transcribe your texts automatically according to the principles that you want. It is designed to work with books, documents, inscriptions and more, in almost any writing-system in the world, operating directly from the digital images. This is particularly interesting and useful now that libraries, archives and other institutions have published many thousands – even millions – of images of documents that are freely available online. Furthermore, the International Image Interoperability Framework (IIIF) means that we can now access these images directly and automatically, even when the images are hosted by different libraries scattered across different countries or even continents. So if you have a long book or document that you wish to transcribe, such as a book of hundreds of pages, or many different books or documents written in the same script, then eScriptorium will probably be very useful for you. Furthermore, if you have a corpus of many thousands of documents even written in different scripts, then eScriptorium can still give you a rough transcription with errors, but this can still be useful. For instance, it may well be good enough to identify the text, and so you could then automatically search many different archives to find copies of a text that you want.

How to Use it in Practice

So, how does one use eScriptorium in practice? To begin, the first step is to import the images of document(s). One way to do this is simply to upload images from your computer, if you have them. Alternatively, you can use the IIIF standard: if your documents are available via IIIF then you can simply find the IIIF URL of your document (the manifest) and drop it into eScriptorium, and the software will then import the entire document automatically. For instance, the Biblissima project has a portal with the manifests of tens of thousands of manuscripts from different libraries and repositories, and here is a video demonstrating the automatic import of images found in the Biblissima portal:

Machine Learning

Clearly eScriptorium does not work by magic, and so we cannot expect it magically to know exactly how we want to read all our documents. Indeed, different documents from different cultures and throughout all of history had very different layouts, and it is very difficult or impossible to program a computer to understand all these different formats in advance. Furthermore, we may not want to transcribe every letter of every page: we may want to leave out the page numbers, for instance, or running headers, or marginal notes, or perhaps we want to include all of these. For this reason, eScriptorium is based on machine learning. The idea here is that as a user of eScriptorium, you create examples of what you want the computer to do, and the machine looks at these examples and uses them to learn what you want. So you can use eScriptorium to annotate images of your documents yourself, to show where the lines of text are, which regions you do and do not want to transcribe, and so on, according to your own needs and practices. The machine will then learn from your examples, and once it has learned, it can then apply the same principles automatically to hundreds or thousands of other images. To get you started, eScriptorium has a ‘default’ system which can detect the layout of basic pages according to the most common needs. It usually is not perfect, but you can correct and modify these results according to your particular needs. Here is a video which shows the process of ‘default’ line detection:

And this video shows the process of correcting these basic results according to the needs of one particular researcher:

Transcription

Once the lines have been identified, we can then start the transcription. Once again, we know very well that different researchers have different principles of transcription, particularly when different writing-systems are involved. Do you want to normalise spellings, or transcribe into a different writing-system (such as romanisation)? How do you want to treat punctuation and abbreviations? Again, with eScriptorium it is up to you to decide which standards and principles to use. As before, you must then create example transcriptions of what you want, so that you can show these to the machine, it can learn from your examples, and then it can continue the work by itself. With eScriptorium you can type the text by hand, you can import an existing text using a standard format such as Page or ALTO XML, or you can copy and paste a text from elsewhere as is demonstrated in this video:

After creating enough examples and showing them to the machine, the computer learns from them and then can transcribe the rest of your documents automatically. You may need to correct the results at first, but you can then use these corrected results to retrain the machine. In general, the more effort you put in at the start, the better the results will be and so the less correction will be necessary, meaning that it will normally get better and faster as you go. The result is a transcription which you can download as text or XML and use how you wish. This video shows the process of automatic transcription:

The following video shows how to use eScriptorium for exporting texts and for training the machine:

It should be clear that if you want to use eScriptorium then you must put in a certain amount of work to get started. This is normal: again, there is no magic, and documents are complicated which makes them interesting, but it also makes them difficult to treat automatically. If you have only one short document, or a number of documents that are all written with very different scripts or layouts, then eScriptorium may not be worth the effort for you. On the other hand, if you have a long document or a large and relatively coherent corpus, then this software could be very useful indeed.

Possibility to Download and Share Trained Models

Related to this, however, is another important feature of eScriptorium: unlike most other comparable systems, in eScriptorium it is possible (and indeed encouraged) to download, publish and otherwise share your trained models, and to upload existing trained models from other projects. To give an example, we are in the process of training different machines to read manuscripts written in Arabic, Hebrew, Syriac and Latin. If your documents are written in scripts that are similar to ours, then you can take our trained machines, upload them to eScriptorium (or any other future software that’s compatible) and use them for your own work. Even if your documents are different, or if you use different principles of transcription, then you can still start with someone else’s models and retrain them for your own material. The benefit of this is that you will need far fewer examples to get started, because the computer already has learned about similar cases and simply needs to adjust to yours. After that, you are strongly encouraged (but not required) to share your own models so that we can all benefit from each other’s work. This saves time and effort, and even reduces the environmental impact by avoiding the retraining of computers with the same information over and over again.

Open Source

The eScriptorum software is entirely free and open source (see below for links). It is based on the Kraken engine which is also free and open source, and which is the research of Ben Kiessling, a research engineer and PhD student in the Digital Humanities team at the École Pratique des Hautes Études – Université PSL. He is an expert in the automatic transcription of manuscripts (better known as HTR, or Handwritten Text Recognition), particularly when applied to writing in scripts other that the Latin alphabet.

Experiences

eScriptorium is already being used by several projects, such as Vietnamica, which is an ERC Advanced Grant at the EPHE lead by Philippe Papin with Marc Bui; Sofer Mahir, which is a project by the EPHE and the University of Haifa on the automatic transcription of medieval Hebrew manuscripts; the Lectaurep project in collaboration with INRIA and the French National Archives; and the Open Islamicate Texts Initiative Arabic-Script Optical Character Recognition Project (OpenITI AOCP) which is funded by the Mellon Foundation and is a collaboration of the EPHE, the University of Maryland College Park, Aga Khan University, Vienna University, and Northeastern University.

Funding

eScriptorium is currently funded by RESILIENCE and will be incorporated into the RESILIENCE Research Infrastructure. Before this, it was part of a PSL research project called Scripta-PSL, and it benefits from the support of the Île de France ‘Domaine d’Intérêt Majeur Sciences Texte et Connaissances Nouvelles’ (DIM STCN) which is helping to buy high-performance computers to run the software, along with the EPHE, the Institut de Recherche et d’Histoire des Textes (IRHT), Scripta-PSL and the Sofer Mahir project.

For more information, see the following links:

The eScripta and eScriptorium project blog: escripta.hypotheses.org
The blog of the EPHE Digital Humanities team: https://ephenum.hypotheses.org/
The website of Kraken, the software by Ben Kiessling which is used by eScriptorium: kraken.re
The eScriptorium source code: gitlab.inria.fr/scripta/escriptorium
The Kraken source code: github.com/mittagessen/kraken
Some repositories of trained models: github.com/mittagessen/kraken-models and zenodo.org/communities/ocr_models/
Kiessling, B., Stökl Ben Ezra, D., Miller M. BADAM: A Public Dataset for Baseline Detection in Arabic-script Manuscripts, HIP@ICDAR 2019. (2019)
Kiessling, B.; Tissot, R., Stökl Ben Ezra, D., Stokes P.A. eScriptorium: An Open Source Platform for Historical Document Analysis, OST@ICDAR 2019. (2019)
Kiessling, B. Kraken – A Universal Text Recognizer for the Humanities. DH 2019 (2019).

– Peter Stokes, Directeur d’études EPHE