Skip to main content

025 - A Perfect Job is the New Very Good Job

A little disclaimer for once, because I usually prefer to praise if I name people. I do not know Dan Cohen nor his work, my criticism of his article is not directed against him personally, but rather it takes his text as one example among many of the kind, that develop the same type of discourse and contain the same type of flaws.

A second disclaimer: I moved the original French version of this post here: posts/025-fr.md.

Earlier this week, my colleague Louis-Olivier Brassard asked me for my opinion on the latest post by Dan Cohen, which he titled "The Writing Is on the Wall for Handwriting Recognition", adding a subtitle that sets the tone: "One of the hardest problems in digital humanities has finally been solved". I wanted to make my critical reading a bit more public, so I'm turning it into a blog post.

I carefully read this article because the subject is of interest to me (obviously), but I must admit that I usually start this kind of reading with a negative a priori. This is the treatment I save for all those posts, whether on blogs or social media, that announce left and right that generative AI has revolutionized this or that -- this and that generally being problems that have occupied researchers and engineers for years, and which gave rise to sometimes heated or even unsolvable debates. All these posts contribute to fueling the hype around generative AI and undermining our already quite worn collective ability to develop critical thinking about it.

Dan Cohen's post follows the release of version 3 of Gemini, Google's generative AI model, publicized as Google's "most intelligent model yet". Like every time a new model of this type is released, several users share the results of their "experiments" with these models. Dan Cohen is not the only one; for example, Mark Humphries also posted a post on the subject on the same day, soberly titled "Gemini 3 Solves Handwriting Recognition and it’s a Bitter Lesson". I saw these two posts widely shared on BlueSky, praised by researchers whom I consider to hold positions of authority in the field of automatic transcription. After reading Dan Cohen's post, I found myself quite annoyed by these shares: I'm not convinced that the text was well read by those who shared it on BlueSky.

In my opinion, the problem with Dan Cohen's post is twofold: 1) he develops a universal discourse on a tool that he has only tested on a minimal selection of examples that say almost nothing about the problems encountered by users of automatic transcription on old documents, 2) his demonstration relies on fallacious arguments.

A matter of scientific rigor

About the first point: Dan Cohen uses three examples that are not at all representative of the challenges of automatic transcription. Right from the start, this would justify a footnote to his subtitle: he says "one of the hardest problems in digital humanities has finally been solved", I add "as far as it concerns epistolary documents written in English during the first half of the 19th century by personalities whose biographies have been written, or whose correspondence has already been edited"1 because that's what he tested. That already reduces the scope of his results quite a bit, doesn't it? Moreover, given that the model fails to transcribe the third example, we could even add that this only concerns documents with a simple layout.2

This first point is really problematic because this post is a text published by a person who has scientific authority and should therefore demonstrate scientific rigor, even we are only talking about a newsletter and not an edited article or book. Following this scientific rigor, I would expect us to limit ourselves to drawing conclusions about what has been successfully demonstrated instead of Doom propheting with flashy (sub)titles. One can be convinced that Gemini is capable of successfully handling many other cases than those presented by Dan Cohen, but that is a matter of belief, not scientific demonstration. I think this is a topic that needs to be discussed more broadly, in a context where AI is messianically served to us in all forms of dishes, but Marcello Vitali-Rosati talks about it well in his latest post or, from another angle and outside the uses by the academic world, there is the recent work of Hamilton Mann.

It happens that the day Louis-Olivier asked me to read Dan Cohen's text, I had also read that of Sunayani Bhattacharya who trained her students at Saint Mary's College of California in text analysis with Voyant Tools and who also evoked automatic transcription in passing in her post. She explains that, with the objective of offering an opening to the Global South to her students, she had them work on texts in Bengali (even though none of them can speak or read Bengali). I find the exercise interesting and promising as she presents it. After developing in her students a familiarity with what Bengali in properly edited press texts look like in Voyant Tools, she showed them what you get when you try to run Voyant Tools on texts directly taken from OCR software. These texts contain a lot of noise and sometimes do not even use the correct character sets. This allows her to give her students a very concrete example of the limitations of software infrastructures when it comes to processing texts in Indic languages. She concludes by reiterating the usefulness of giving students a better idea of what on the ground anglophone biases look like in technology. In a text like the one I discuss in this post, this anglophone bias (and I would even add modernist) is blatant.

A shaky demonstration

Now, regarding the second point, it requires taking a closer look at what Dan Cohen tells us and the examples he gives. There are inaccuracies that need to be pointed out, but also excerpts that do not correspond to the statements made in the post.

Let's start with an inacuracy that actually regards the question of model accuracy. I have already discussed this in a previous post because it seems to me that this is one of the topics where researchers are most lazy: what accuracy are we talking about, and what are the limits of these accuracy measures? Dan Cohen states that "the best HTR software struggles to reach 80% accuracy". As he clarifies that this means 2 wrong words out of 10 words, we already see that he is talking about word error rate and not character error rate. Such an error rate says nothing about the readability of the text since a single error is enough for the word to be counted as wrong. In a sentence like "the hardest problem in digtial humaities has finolly beeen sol ved", one word out of two contains a mistake, yet it seems to me that the sentence is perfectly readable.3 To put things into perspective, the character accuracy rate in this sentence is 90.77% (according to software like KaMI). In addition to this initial inaccuracy, Dan Cohen's statement about the difficulties of traditional software seems false to me. I do not see on what source he bases himself. For documents like those he tests, we are well above 80% accuracy, even at the word level, and this with several models and several software using RCNNs or Transformers.

Since this initial statement surprised me, I wanted to look closer at Transkribus' output to see if it really did this many errors. Of course, there are errors in Transkribus' transcriptions. Yet, when we look at the source document, we see that some of these errors are understandable in a zero-shot context. When Boole draws two "l"s in a row, his second "l" looks like an "e" with a very very small loop. This explains why Transkribus' prediction contains errors on "tell" (read as "tele") on the left page, and "All" (read as "Ale") on the right page. To find out the real extent of Transkribus' errors, I made my own transcription of the double page tested by Dan Cohen, line by line (following the line order taken from the segmentation in Transkribus, and helping myself a bit with the reading proposed by Gemini4). When I calculate the accuracy rate on this excerpt, I get a character accuracy of about 95% and a word accuracy of 88%.5 So there is plenty of room for improvement, but we are not in a catastrophic situation as the preamble suggests.

If we now turn to the transcription generated by Gemini, we can see that there are actually some errors as well, whereas Dan Cohen is telling us that "Gemini transcribed the letter perfectly". For example, Gemini transcribes, on the right page, "occasionally by",6 generating as additional precision in a notes section that "On the right page (line 8), the handwriting becomes very scribbled. It appears to say 'take a long walk occasionally try & once or twice...' or possibly 'occasionally by & once or twice...'." Gemini fails here to propose reading a hyphenation that makes sense and prefers to add a word in its transcription. The problem is not that Gemini did not make a perfect transcription of course, but rather that Dan Cohen states it without noting this error.

We have the same issue in the second example, where Gemini formats the word "transmitted" to indicate that it is crossed out in the source when it is not. The text generated by Gemini leaves no doubt about the look of the text in the source, and invents an intention on the part of the author: "In the second line of the body, the word 'transmitted' is crossed out in the original text, but the sentence is grammatically incomplete without it (or a similar verb). It is likely the author meant to replace it to avoid repetition with the word 'transmitting' appearing a few lines later but forgot to insert the new word." Whereas this error was easier to spot, Dan Cohen once again tells us: "Another perfect job."

Then comes the third example. Gemini does not offer a complete transcription of this one, and after a few lines, generates a message indicating that the text is illegible beyond a certain point. This allows Dan Cohen to conclude: "Gemini does the right thing here: rather than venture a guess like a sycophantic chatbot, it is candid when it can’t interpret a section of the letter." I personally choke reading that, given the errors already noted in the two previous examples. Contrary to what Dan Cohen claims, there is no candor here, but rather a perverse effect of what I imagine is a calibration of the model based on its perplexity rate. In the first two examples, we can imagine that the model's perplexity regarding certain difficult passages leads to the generation of a note and/or an insert in brackets, but does not prevent the generation of a false transcription. It goes unnoticed all the more because the explanations generated in notes sound good, even if they are false. We are not dealing with a candid robot, but rather with a scammer chatbot, a presti-generator, who finds an escape route when the situation is too big for a subtle feint. And in my opinion, it would really be time for users of these software to integrate this reality, taking an even closer look when they control what these tools generate.

I haven't yet read Mark Humphries' post that I mentioned at the beginning, but I might come back to the subject in the future. To be honest, what I find really really unfortunate about these publications, coming from the academic world, which help to fuel the hysteria around generative AI, is that it gives me the impression that decisively it will not be from the scientific community that Salvation will come. As a citizen and a young researcher, this worries me a lot.

EDIT: 2025-12-01: Minor corrections and addition of another footnote.

EDIT: 2025-12-04: Translated the post to English (with the help of Copilot) and moved the French version to another path: posts/025-fr.md.


  1. I give this precision about the edition of biographies and correspondences because it is important: Dan Cohen did not take documents that we are sure are unpublished. Given that generative AI models are trained from everything that can be found on the Web, this means that these letters may have, in one way or another, been part of the batches used for training. For example, on the website of the Archives of University College Cork, from which the digitization of Boole's letter is taken, we find the following text in the description field: "Boole in Cork to Maryann. He is in a very depressed mood, life has become monotonous with only his work adding interest to the day. He enjoys playing the piano but 'it would be better with someone else to listen and to be listened to'. He is also very annoyed by [Cropers] dedicating his book to him without first asking for permission - 'I cannot help feeling that he has taken a great liberty' - and speaks in strong terms of [Cropers] 'pretensions to high morality'. He invites and urges Maryann to visit him as soon as their mother's health would allow. He feels the climate would do her good." These are contextual elements that can help a model when transcribing. 

  2. I purposefully use the term "simple layout" rather than "standard layout" because the phenomenon illustrated by the third example, the rewriting on the same sheet after having turned it 90°, corresponds to a practice that can be found at least until the mid-20th century. 

  3. By readable, I mean that one does not need to know what the original sentence was to understand what we should have read in place of the errors. I admit however that depending on familiarity with the text or the language or the nature of the errors, this readability may vary. If you still find this sentence unreadable, it should be read as follows: "the hardest problem in digital humanities has finally been solved". There was 1 letter inversion in "digital", one missing letter in "humanities", one letter substituted by another in "finally", one extra letter in "been" and an inappropriate separation in "solved". 

  4. I rapidly develop in the question of the layout. In Gemini's transcription, there are additional pieces of information that suggest that the model correctly identified which part of the text corresponds to which page. In Transkribus' transcription, this is not the case, but I think it's because Dan Cohen only used Transkribus' basic web page from testing models. If he had used the full version of Transkribus, I'm sure the software would have also perfectly identified the double-page layout. As for the line-by-line transcription, we no longer have this information in Gemini's transcription, which generates the text continuously. 

  5. Among the errors made by Transkribus, we can also note the use of a "в" (the Cyrillic v) to transcribe the "B" in the margin of the document, and a "р" (the Cyrillic r) to transcribe the "P" that follows. These are errors that escape us when we do a quick visual check, which do not hinder reading by humans, but which lower the accuracy calculated automatically since a в is not a B and a р is not a P, nor indeed a p (see what I did here?). 

  6. Transkribus transcribed it as "occasion by". 

025 - A Perfect Job is the New Very Good Job (FR)

A little disclaimer for once, because I usually prefer to praise if I name people. I do not know Dan Cohen nor his work, my criticism of his article is not directed against him personally, but rather it takes his text as one example among many of the kind, that develop the same type of discourse and contain the same type of flaws.

Plus tôt cette semaine, mon collègue Louis-Olivier Brassard m'a demandé mon avis sur le dernier billet posté par Dan Cohen, qu'il a intitulé "The Writing Is on the Wall for Handwriting Recognition", ajoutant un sous-titre annonçant la couleur: "One of the hardest problems in digital humanities has finally been solved". J'avais envie de rendre un peu plus public ma lecture critique, donc j'en tire un billet de blog, en français pour une fois.

J'ai lu avec attention cet article car le sujet m'intéresse (forcément), mais je ne cache pas que je débute en général ce genre de lecture avec un a priori négatif. C'est le traitement que je réserve à tous ces postes, de blog ou sur les réseaux sociaux, qui annoncent à tour de bras que l'IA générative a révolutionné ceci ou cela -- ceci et cela étant généralement des problèmes qui ont occupé des chercheur-ses et ingénieur-es depuis des années, et qui donnent lieu à des débats parfois houleux voire insolvables. Tous ces billets contribuent à alimenter l'esbroufe de l'IA générative et à saper notre capacité collective déjà pas mal usée à développer une pensée critique à son endroit.

Le billet de Dan Cohen fait suite à la sortie de la version 3 de Gemini, le modèle d'IA générative de Google, publicisé comme le modèle de Google "le plus intelligent à date" ("our most intelligent model yet" dit Google). Comme à chaque fois qu'un nouveau modèle de ce type sort, plusieurs utilisateurs partagent les résultats de leurs "expérimentations" avec ces modèles. Dan Cohen n'est pas le seul, par exemple Mark Humphries a aussi posté le même jour un billet sur le sujet intitulé sobrement "Gemini 3 Solves Handwriting Recognition and it’s a Bitter Lesson". J'ai beaucoup vu ces deux billets relayés sur BlueSky, salués par des chercheurs que j'estime occuper des place d'autorité dans le domaine de la transcription automatique. Après avoir lu le billet de Dan Cohen, je me suis retrouvée assez agacée de ces relais: je ne suis pas convaincue que le texte ait été bien lu par ceux qui l'ont relayé sur BlueSky.

A mon avis, le problème du billet que Dan Cohen est double: 1) il développe un discours universel sur un outil qu'il n'a testé que sur sélection minime d'exemples qui ne disent presque rien des problèmes que rencontrent les utilisateurs de la transcription automatique sur les documents anciens, 2) sa démonstration tient sur des arguments fallacieux.

Un problème de rigueur scientifique

Sur le premier point tout d'abord. Dan Cohen utilise trois exemples qui ne sont pas du tout représentatifs des défis de la transcription automatique. D'emblée, cela justifierait une note de bas de page à son sous-titre: il dit "l'un des problèmes les plus difficiles des humanités numériques a enfin été résolu", j'ajoute "en ce qui concerne les documents épistollaires rédigés en anglais durant la première moitié du XIXe siècle par des personnalités dont des biographies ont été écrites, voire dont la correspondance à déjà été éditée"1 car c'est ce qu'il a testé. Ca réduit déjà pas mal la portée de ses résultats, non? D'ailleurs, étant donné que le modèle ne parvient pas à transcrire le troisème exemple, on pourrait même ajouter que cela ne concerne en plus que les documents dont la mise en page est simple.2

Ce premier point est vraiment problématique parce qu'il s'agit d'un texte publié par une personne qui a une autorité scientifique et qui devrait donc faire preuve de rigueur scientifique, même si ce texte n'est qu'une newsletter et pas un article ou un ouvrage édité. J'attendrais de cette rigueur scientifique qu'on se limite à tirer des conclusions sur ce que l'on a réussi à démontrer au lieu de jouer les Cassandre avec des (sous-)titres tape-à-l'oeil. On peut avoir la conviction que Gemini est capable de traiter avec succès bien d'autres cas que ceux présentés par Dan Cohen, mais cela relève de la croyance, pas de la démonstration scientifique. Je pense que c'est un sujet qui doit être discuté plus largement, dans un contexte où l'IA nous est messianiquement servie à toutes les sauces, mais Marcello Vitali-Rosati en parle bien dans son dernier billet ou encore, sous un autre angle et qui sort des usages par le monde académique, il y a le récent travail d'Hamilton Mann.

Il se trouve que le jour où Louis-Olivier m'a demandé de lire le texte de Dan Cohen, j'avais aussi lu celui de Sunayani Bhattacharya qui a formé ses élèves du Saint Mary’s College en Californie à l'analyse de texte avec Voyant Tools et qui traite aussi de transcription automatique au détour de son billet. Elle explique que, dans l'optique de proposer une ouverture vers le Sud Global à ses étudiant-es, elle les a fait travailler sur des textes en Bengali (même si aucun ne sait parler ou lire le Bengali). Je trouve l'exercice intéressant et prometteur tel qu'elle le présente. Après avoir développé chez ses élèves une familiarité avec ce à quoi ressemble les textes de presse correctement édités dans Voyant Tools, elle leur a montré ce qu'on obtient quand on tente de faire tourner Voyant Tools sur des textes directement sortis d'un logiciel d'OCR. Ces textes contiennent énormément de bruit et parfois n'utilisent même pas les bons jeux de caractères. Cela lui permet de donner un exemple très concret à ses étudiant-es des limites des infrastructures logicielles dès qu'il s'agit de traiter de textes en langues indiennes. Elle conclut en redisant l'utlité de donner aux étudiant-es une meilleure idée de ce à quoi ressemblent les biais anglophones dans la technologie quand on est sur le terrain. Dans un texte comme celui dont je discute dans ce billet, ce biais anglophone (et j'ajouterai même moderniste) saute aux yeux.

Une démonstration bancale

Maintenant, concernant le deuxième point, il suppose de regarder d'un peu plus près ce que Dan Cohen nous dit et les exemples qu'il donne. Il y a des imprécisions qui doivent être relevées, mais aussi des extraits qui ne correspondent pas aux affirmations qui sont faites dans le billet.

Une imprécision qui commence justement par la question de la précision des modèles. J'en ai déjà parlé dans un précédent billet car il me semble que c'est l'un des sujets où les chercheurs font le plus preuve de paresse: de quelle précision on parle, et quelles sont les limites de ces mesures de précision ? Dan Cohen affirme que "les meilleurs logiciel d'HTR ont du mal à atteindre 80% de précision". Comme il clarifie que cela signifie 2 mots faux tous les 10 mots, déjà on s'aperçoit qu'il nous parle de taux d'erreur au mot et non au caractère. Un tel taux d'erreur ne dit rien de la lisibilité du texte puisqu'une seule erreur suffit pour que le mot soit compté comme faux. Dans une phrase comme "the hardest problem in digtial humaities has finolly beeen sol ved", un mot sur deux contient une faute, pourtant il me semble que la phrase est parfaitement lisible.3 Pour mettre les choses en perspective, le taux de précision au caractère dans cette phrase, lui, est de 90.77% (d'après un logiciel comme KaMI). En plus de cette imprécision de départ, l'affirmation de Dan Cohen sur les difficultés des logiciels traditionnels me semble fause. Je ne vois pas sur quelle source il se base. Pour des documents comme ceux qu'il teste, on est bien au-delà des 80% de précision, y compris au mot, et ce avec plusieurs modèles et plusieurs logiciels.

Comme cette affirmation m'a surprise, j'ai voulu regarder si vraiment le modèle de Transkribus avait fait autant de fautes que ça. Bien sûr, il a fait des erreurs. Quand on regarde le document source, on voit que certaînes sont compréhensibles dans un contexte zero-shot: lorsque Boole trace deux "l" à la suite, son deuxième "l" ressemble à un "e" avec une boucle très très petite. C'est ce qui explique que la prédiction de Transkribus contient des erreurs sur "tell" (lu "tele") sur la page de gauche, et "All" (lu "Ale") sur la page de droite. Pour savoir quelle était vraiment l'ampleur des erreurs de Transkribus, j'ai fait ma propre transcription de la double page, ligne par ligne (en suivant l'ordre des lignes tiré de la segmentation dans Transkribus, et en m'aidant un peu de la lecture proposé par Gemini4). Quand je calcule le taux de précision sur cet extrait, j'obtiens une précision au caractère d'environ 95% et une précision au mot de 88%.5 Largement perfectible donc, mais on n'est pas dans une situation catastrophique comme le laisse supposer le préambule.

Maintenant, si on regarde la transcription de Gemini, on s'aperçoit qu'il y a en fait aussi des erreurs, alors que Dan Cohen nous dit "Gemini transcribed the letter perfectly". Par exemple, Gemini transcrit, sur la page de droite, "occasionally by",6 en générant comme précision complémentaire dans une section de notes que "On the right page (line 8), the handwriting becomes very scribbled. It appears to say 'take a long walk occasionally try & once or twice...' or possibly 'occasionally by & once or twice...'." Donc Gemini, échoue ici à proposer de lire une césure qui fait pourtant sens et préfère ajouter un mot dans sa transcription. Le problème ce n'est pas que Gemini n'ai pas fait une transcription parfaite bien sûr, mais plutôt que Dan Cohen l'affirme sans relever cette erreur.

On a le même problème dans le deuxième exemple, où Gemini met en forme le mot "transmitted" pour signaler qu'il est barré dans la source alors que ce n'est pas le cas. Le texte généré par Gemini ne laisse pas de doute vis-à-vis de l'aspect du texte dans la source, et invente une intention de la part de l'auteur: "In the second line of the body, the word 'transmitted' is crossed out in the original text, but the sentence is grammatically incomplete without it (or a similar verb). It is likely the author meant to replace it to avoid repetition with the word 'transmitting' appearing a few lines later but forgot to insert the new word." Alors que cette erreur était plus facile à repérer, Dan Cohen nous dit pourtant encore une fois: "Another perfect job."

Le coup de grâce à mon avis vient avec le troisième exemple. Gemini n'en propose pas de transcription complète, et génère, après quelques lignes, un message indiquant que le texte est illisible au-delà d'un certain point. Cela permet à Dan Cohen d'en conclure: "Gemini does the right thing here: rather than venture a guess like a sycophantic chatbot, it is candid when it can’t interpret a section of the letter." Personnellement, je m'étouffe en lisant ça, vu les erreurs déjà notées dans les deux exemples précédents. Au contraire de ce qu'affirme Dan Cohen, il n'y a pas de candeur ici, mais plutôt une effet pervers de ce que j'imagine être un calibrage du modèle en fonction de son taux de perplexité. Dans les deux premiers exemples, on peut imaginer que la perplexité du modèle face à certains passages difficiles conduit à la génération d'une note et/ou d'un insert entre crochets, mais n'empêche pas la génération d'une transcription fausse. Elle passe d'autant plus inaperçue que les explications générées en notes sonnent bien, même si elles sont fausses. On n'a donc pas affaire à un robot candide, mais à un chatbot arnaqueur, un presti-générateur, qui trouve une porte de sortie lorsque la situation est trop grosse pour une feinte subtile. Et à mon avis il serait vraiment temps que les utilisateurs de ces logiciel intègrent cette réalité, en ayant la main d'autant moins légère quand ils contrôlent ce que génèrent ces outils.

Je n'ai pas encore lu le billet de Mark Humphries que je mentionnais tout au début, mais j'aurais peut-être l'occasion de revenir encore sur le sujet. A vrai dire, ce que je trouve vraiment vraiment dommage avec ces publications, issues du monde académique, qui contribuent à alimenter l'hystérie autour de l'IA générative, c'est qu'elle me donne l'impression que décidément ce n'est même pas de la part de la communauté scientifique que viendra le Salut. En tant que citoyenne et jeune chercheuse, cela m'inquiète beaucoup.

EDIT: 2025-12-01: Petites corrections et ajout de notes d'une note supplémentaire en bas de page.

EDIT: 2025-12-04: Traduction du post en anglais, et déplacement de la version française vers un autre chemin: posts/025-fr.md.


  1. Je donne cette précision sur l'édition des biographies et des correspondances car elle me semble importante: Dan Cohen n'a pas pris des documents dont on est sûr qu'ils soient inédits. Etant donné que les modèles d'IA générative sont entraînés à partir de tout ce qui peut être trouvé sur le Web, cela veut dire que ces lettres ont peut-être d'une manière ou d'une autre, fait partie des lots utilisés pour l'entraînement. Par exemple, sur le site des Archives du University College of Cork, d'où est tirée la numérisation de la lettre de Boole, on trouve le texte suivant dans le champ description: "Boole in Cork to Maryann. He is in a very depressed mood, life has become monotonous with only his work adding interest to the day. He enjoys playing the piano but 'it would be better with someone else to listen and to be listened to'. He is also very annoyed by [Cropers] dedicating his book to him without first asking for permission - 'I cannot help feeling that he has taken a great liberty' - and speaks in strong terms of [Cropers] 'pretensions to high morality'. He invites and urges Maryann to visit him as soon as their mother's health would allow. He feels the climate would do her good." Ce sont des éléments de contexte qui peuvent aider, y compris un modèle, au moment de transcrire. 

  2. Je ne dis pas "mise en page standard", parce que le phénomène qui est illustré par le troisième exemple, le fait de réécrire sur la même feuille après l'avoir tournée à 90°, correspond à une pratique qu'on retrouve au moins jusqu'au milieu du XXe siècle. 

  3. Par lisible, je veux dire qu'on n'a pas besoin de savoir quelle était la phrase de départ pour comprendre ce qu'on aurait du lire dans les erreurs. J'admets par contre qu'en fonction de la familiarité avec le texte ou de la langue ou de la nature des erreurs, cette lisibilité peut varier. Si jamais vous trouvez quand même cette phrase illisible, il faut la lire comme ceci: "the hardest problem in digital humanities has finally been solved". Il y avait 1 inversion de lettres dans "digital", une lettre manquante dans "humanities", une lettre substituée par une autre dans "finally", un lettre en trop dans "been" et une séparation inappropriée dans "solved". 

  4. Je développe très rapidement sur le point de la mise en page. Dans la transcription de Gemini, il y a des compléments d'informations qui suggèrent que le modèle a bien identifié à quelle page correspond telle ou telle partie du texte. Dans la transcription de Transkribus, ce n'est pas le cas, mais je pense que c'est parce que Dan Cohen a seulement utilisé la page de test de modèles de transcription de Transkribus. S'il avait utilisé la version complète de Transkribus, je suis sûre que le modèle aurait aussi parfaitement identifié la mise en page en double page. Pour ce qui concerne la transcription ligne par ligne, on n'a plus cette information dans la transcription de Gemini, qui génère le texte en continu. 

  5. Parmi les erreurs de Transkribus, on peut aussi noter l'utilisation d'un "в" (le v cyrillique) pour transcrire le "B" de la côte du document, et d'un "р" (le r cyrillique) pour transcrire le "P" qui suit. Ce sont des erreurs qui nous échappent quand on fait un contrôle visuel rapide, qui ne gêne pas la lecture par les humains, mais qui font baisser la précision calculée automatiquement puisque qu'un в n'est pas un B et un р n'est pas un P, ni d'ailleurs un p (see what I did here?). 

  6. Transkribus l'avait transcrit "occasion by". 

024 - The messy backstage of a literature review

A few weeks ago, I began a thorough review of articles published in four digital humanities venues to track mentions of automatic text recognition and understand how, where, and why scholars use it. Although I wish I had started sooner in my doctoral journey, I stay positive holding on to the idea that "it's never too late." I learn a lot about Digital Humanities as a field of research and gain a better understanding of ATR's presence in the field.

While catching up on our dissertation progress, I was telling Roch Delanney about the survey I'm conducting, my goals for it, and how I selected and sorted the articles. Roch suggested that I share my method more widely. It seems a little clumsy at times, but I am also able to use many different skills I have learned and sharpened over the years so I think it is indeed interesting to share a bit of my cuisine.

Perimeter of the literature review

My literature review focuses on four publication venues. I think they are, collectively, representative of research in the Digital Humanities:

  • Digital Scholarship in the Humanities (DSH), which is presented by the Alliance of Digital Humanities Organizations (ADHO) as an international, peer-reviewed journal published by Oxford University Press on behalf of ADHO and the European Association for Digital Humanities (EADH). It was published under the title Literary and Linguistic Computing: The Journal of Digital Scholarship in the Humanities until 2014. I counted a total of 174 volumes for a total of 1741 articles (excluding retracted articles, book reviews, editorials and committee reports) published since 1985 until the first half of 2025.

  • Digital Humanities Quarterly (DHQ) is an open-access peer-reviewed journal, probably more representative of research in North America. It is published by the Association for Computers and the Humanities (ACH). I counted a total of 790 articles published since its first issue in 2007. Most articles are in English.

  • The Journal of Data Mining and Digital Humanities (JDMDH), is published by Episciences since 2017. Contrary to DHQ, its focus is more European-centric, and it has a special volume dedicated specifically to automatic text recognition (directed by Ariane Pinche and Peter Stokes). I found a total of 162 articles published in JDMDH, including the special volume on ATR.

  • Lastly, the proceedings from the more recent Computational Humanities Research (CHR) conferences (see the 2024 conference proceedings for example) offer a perspective on research focused on more intensively computational methods in the Humanities. The conference is held annually since 2021. I found a total of 214 articles in the proceedings.

Aside from DSH, that I can access thanks to the library of the University of Montréal, all the other journals are in open access.

Collecting the articles and their metadata

For JDMDH, articles are not centralized on the journal website but rather published on platforms like HAL or arXiv and sometimes Zenodo. Getting an overview of the articles published in JDMDH is not straightforward, but it is possible to browse the articles per volumes. I opened and downloaded each article in each volume, as well as collected the article entries in Zotero using the Zotero connector. The process was cumbersome and required many clicks, but the variety of publishing platforms deterred me from writing a script to automate the downloading process.

CHR, on the other hand, was very easy to scrape, partly because there are only four volumes of proceedings so far. For each series of proceeding, the index of all articles is compatible with the batch import scenario of the Zotero connector. To collect the PDFs, I used a section of the HTML page and regular expressions to identify the links to the PDF files, creating a list of URLs. Finally, I used a Python script to download the PDFs to my computer.

For example, in https://ceur-ws.org/Vol-2989/, the ul contains simple HTML elements pointing to the PDF files, such as:

<h3><span class="CEURSESSION">Presented papers</span></h3>

<ul>
  <li id="long_paper5"><a href="long_paper5.pdf">
      <span class="CEURTITLE">Entity Matching in Digital Humanities Knowledge
      Graphs</span></a>
    <span class="CEURPAGES">1-15</span> <br>
    <span class="CEURAUTHOR">Juriaan Baas</span>,
    <span class="CEURAUTHOR">Mehdi M. Dastani</span>,
    <span class="CEURAUTHOR">Ad J. Feelders</span>
  </li>
...

All I had to do was copy and paste this entire list into a text editor (I like to use Sublime Text in such a situation). Then, I used a simple regular expression like href=".+?" to select the value in the a element, which contains the links to the PDF files. I kept only the selected text and then rebuilt the complete URL with a couple of replacements such as href=" -> "https://ceur-ws.org/Vol-2989/ and "\n -> ",\n. At this point I just added square brackets around the selection, et voilà! I had a Python list ready to be passed to a script like the one below to download the files:

list_of_urls = ["https://ceur-ws.org/Vol-2723/short8.pdf",
                "https://ceur-ws.org/Vol-2723/long35.pdf",
                "https://ceur-ws.org/Vol-2723/long44.pdf",
                #...
                ]

import requests
import os
from tqdm import tqdm # it makes  progress bar so I know how long I can take to make a tea while the script runs

for url in tqdm(list_of_urls):
    r = requests.get(url)
    if r.status_code == 200:
        filename = f"{url.split("/")[-2]}-{url.split("/")[-1]}"
        #print(filename)
        with open(filename, "wb") as f:
            f.write(r.content)
    else:
        print(f"Failed to download: {url}")
    time.sleep(1)  # This cool down is to be polite to the server

I used a similar approach for downloading the articles from DHQ because the Index of Titles lists all of the published articles on a single page. I first downloaded the HTML pages of the articles (DHQ publishes articles in HTML format as well as PDF). I also used regular expressions to extract the list of links and used a Python script to download the files.

Unfortunately, the Zotero connector only works on each article page individually, but not for batch-import on the index page. I investigated a bit to understand why it was so, and found that in the source code of each article page, there is a span element identified with the class Z3988 that the Zotero connector uses to extract the metadata and create an entry in Zotero. In DHQ, these spans look like this:

<span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft.genre=article&amp;rft.atitle=Academics%20Retire%20and%20Servers%20Die%3A%20Adventures%20in%20the%20Hosting%20and%20Storage%20of%20Digital%20Humanities%20Projects&amp;rft.jtitle=Digital%20Humanities%20Quarterly&amp;rft.stitle=DHQ&amp;rft.issn=1938-4122&amp;rft.date=2023-05-26&amp;rft.volume=017&amp;rft.issue=1&amp;rft.aulast=Cummings&amp;rft.aufirst=James&amp;rft.au=James%20Cummings"> </span>

I understood recently, while discussing with Margot Mellet, that Z3988 is a reference to the OpenURL Framework Standard (ISO Z 39.88-2004), which is used by the Zotero connector. Also, I should note that such spans are not systematically used in online journals. JDMDH for example doesn't use them, and serves the metadata in a different way.

Since I had already downloaded all the articles from DHQ as HTML files, I wrote a simple Python script that found all of such spans for each downloaded article and aggregated them in a single, very simple HTML file. Then, I simply opened this page in my browser after emulating a local server1 (with a command like python -m http.server), and I was able to use the Zotero connector to import all the articles in a single click. It was very satisfying! The only downside is that I couldn't collect the articles' abstracts because there weren't included in the spans.

DSH was different from the rest of the journals. Because of the longevity of the journal and the amount of articles it published, it was quite overwhelming. Unfortunately, it is a paywalled journal and I couldn't figure out how to make the proxy of the University of Montreal library work with my Python scripts and the command line. As a result, I had to manually download the articles,2 but only when they were relevant! Since DSH has a fairly good search engine that allows to do multi-keyword searches, I only downloaded articles matching my search criteria (143 in total).

Additionally, I went through each of the 174 issues of DSH to batch-import the article references in Zotero. It was tedious but I figured I might be able to use these metadata for other projects in the future.

Filtering the articles

For DHQ, JDMDH and CHR, I ran a keyword surch using the command grep on the content of the articles. I didn't want to limit my search to the titles, abstract or keywords because I really wanted to include anecdotal mentions of automatic text recognition in my results.

To use grep, I created a file (pattern.txt) with the keywords I was looking for:

HTR
OCR
text recognition
ATR
Transkribus
eScriptorium
automatic transcription

Then I converted the PDFs into text files using the command pdftotext. This was necessary because grep cannot search inside a PDF directly. I didn't need to do this conversion for DHQ, since I had download HTML files from that journal.

The commands to search inside the PDFs of one of the journals would look like this:

ls *.pdf | xargs -n1 pdftotext # to convert PDFs to text files
grep -i -w -m5 -H -f ../pattern.txt *.txt # to search for the keywords in the text files and display the first 5 matches

After controlling how grep matched the keywords, I used grep -l -f ../pattern.txt *.txt to list the files that matched the keywords. This list was used to sort the documents into two folders, according to whether or not they matched my research.

In the case of DSH, I directly used the search engine to combine the keywords, using the "OR" operator. I set the full text of the articles as the scope of my research: https://academic.oup.com/dsh/search-results?allJournals=1&f_ContentType=Journal+Article&fl_SiteID=5447&cqb=[{%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22automatic%20transcription%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22transkribus%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22text%20recognition%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22escriptorium%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22OCR%22}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22HTR%22}]}]&qb={%22_text_1-exact%22:%22automatic%20transcription%22,%22qOp2%22:%22OR%22,%22_text_2-exact%22:%22transkribus%22,%22qOp3%22:%22OR%22,%22_text_3-exact%22:%22text%20recognition%22,%22qOp4%22:%22OR%22,%22_text_4-exact%22:%22escriptorium%22,%22qOp5%22:%22OR%22,%22_text_5%22:%22OCR%22,%22qOp6%22:%22OR%22,%22_text_6%22:%22HTR%22}&page=1

In both cases, the search was not case sensitive, in order to catch a maximum of occurrences of keywords like "automatic text recognition" or "Text Recognition" or "text recognition", etc. However, it meant that sometimes I found false positives: "democracy" often matches with "ocr", so does "theatre" with "atr". Since DSH's search engine returns the match in context, I was able to ignore these false positives. For the other journals, I had to manually check where the matches were. Usually, I combined this control with the next step of my investigation.

Hits per journal

  • JDMDH: 47 hits (out of 162 articles)
  • DHQ: 93 hits (out of 790 articles)
  • DSH: 143 relevant hits (out of 1741 articles)
  • CHR: 65 hits (out of 214 articles)

Dépouillement and analysis

To this date, I am still in the process of reading the articles and taking notes on the occurrences of my keywords.

I use Zotero to keep track of the articles I read and to confirm whether they are false positives. Sometimes, I leave out articles that are irrelevant, even if they mention a keyword I was looking for. For example, Liu & Zhu (2023)3 contains the string "OCR" but it only appears in a title in their bibliography, for work they refer to in a context where OCR is not relevant to their argument. With tags in Zotero, I clearly identify such articles as "to be left out" from my analysis, but I don't remove them from the collection.

I use different tags to identify the various occurrences of the technology in the articles. For example, I distinguish between firsthand applications of ATR and the reuse of data produced by ATR before the experimentation presented by the authors. Typically, there are many mentions of documents that were OCRed by libraries and used by scholars to conduct their research. Overall, with this analysis, I am trying to add more depth to the observations made by Tarride et al (2023)4 in which they pragmatically considered three situations leading to the use of ATR: 1) for the production of digital editions; 2) for the production of large searchable text corpora; and 3) for the production of non-comprehensive transcriptions to feed knowledge bases. However, it is difficult to elaborate definitive categories before I am done processing all the collected articles.

Due to the large number of articles to be analyzed, I have continued to use the grep command to quickly review the content of articles and speed up my sorting process. For example, I am more interested in firsthand usages of ATR, want to be able to quickly identify non relevant mentions of my keywords as was the case in Liu & Zhu (2023). The command grep -i -w -C 5 -H -f ../pattern.txt *.txt > grep_out allows me to generate a file, grep_out, in which, for each time a keyword is matched in a document, five lines of context are displayed before and after the match, as well as the name of the file. I still have to read the abstracts and parts of the articles to clearly understand in which contexts the automatic text recognition technologies are used. However, this is an effective method for quickly sorting through the articles.

I'm looking forward to sharing the results of this analysis in my dissertation!


  1. This emulation is necessary to allow the Zotero connector to work properly. 

  2. I want to specify here that it was not by lack of reading documentations on proxies and requests. Unable to find a straightforward solution, unsure if it was even something that the UdeM proxy allowed, and because I would have still needed to write additional scripts afterwards, I decided that it would take just as long to do it manually (about 2-3 hours). 

  3. Liu, Lei, and Min Zhu. "Bertalign: Improved Word Embedding-Based Sentence Alignment for Chinese–English Parallel Corpora of Literary Texts." Digital Scholarship in the Humanities 38, no. 2 (June 1, 2023): 621–34. https://doi.org/10.1093/llc/fqac089

  4. Tarride, Solène, Mélodie Boillet, and Christopher Kermorvant. "Key-Value Information Extraction from Full Handwritten Pages." arXiv, April 26, 2023. https://doi.org/10.48550/arXiv.2304.13530

023 - Writing a PhD manuscript with Markdown and Quarto

The deadline for finishing the dissertation is approaching. And there is still so much to do! This is one of the main reasons why this research blog has been quiet for the last few months, even though there are many topics I would like to write about.

But I guess I can take a short break from time to time and go with the flow of writing a blog post in one sitting. Who knows, maybe I'll do a few more before it's time to turn in my dissertation. I want to talk about my writing setup because it is something I have thought about a lot, trying to find the best compromise.

Writing my dissertation in Microsoft Word has never been an option, although I do use Google Docs from time to time to get quick feedback from my supervisors.

LaTeX may seem like an obvious choice to some of my fellow PhD writers, but I usually limit my use of LaTeX to Overleaf, an online LaTeX editor. On the one hand, I didn't necessarily want to install LaTeX locally for the time being, and on the other hand, I couldn't imagine writing a whole dissertation using Overleaf, because working in my browser can be distracting, and because it would require that I always have access to the Internet to work. To be honest, I mostly didn't want to use LaTeX in the first place because I find the syntax too distracting when I'm writing. It's super useful for getting good control over the layout of the document for the final version of the manuscript, but it's not convenient to work with while I'm formulating ideas and arguments.

I will probably use LaTeX to prepare the final version of the manuscript, but I wanted to use something lighter to structure my document, but easily convertible to LaTeX down the road.

And I am a big fan of Markdown.

Markdown has a syntax that is light enough not to be too distracting - I use it all the time when taking notes anyway, so it is fully part of my writing reflexes. Also, in the context of writing my dissertation, I think of Markdown as text that I can easily copy and paste into a Google document when I need feedback, without losing formatting and without compromising readability in Google Docs. I've seen some LaTeX copy-pasted into Google Docs for supervisor feedback, and I don't think it would work for me.

In addition to Markdown, I wanted to be able to use a modular approach to building my manuscript. A modular approach means having several smaller text files that are eventually merged into a single master document. LaTeX also relies on modularity with commands like \include{}. Modularity is important because in a very long text document it is easy to get lost between inline comments, draft passages, and finished paragraphs. There's also the risk of accidentally deleting passages. With a modular structure, it will also be easier to move paragraphs around as I progress. Also, my manuscript is versioned with Git and synchronized with a private GitHub repository, and modularity makes versioning much easier.

Instead of programming my own manuscript builder - yes, that was my first impulse - I took a closer look at the documentation for Quarto, which I've been using for a little over a year to create slides and websites for the courses I teach. Quarto offered me a solution on a silver platter, because it supports building books with Markdown, which is close enough to a phD thesis.

Quarto implements a single-source publishing paradigm and acts as a shell around pandoc, which allows for swift conversion from one format to another, including from Markdown to LaTeX. I can split the document into multiple smaller Markdown files, and use my book's config file to specify the order in which the Markdown files are aggregated. Quarto's Markdown implementation includes some cool stuff from pandocs, including citation and cross-reference management. It's really worth taking a look at the documentation.

So with Quarto, I can write my dissertation as a series of smaller Markdown files, and end up with a master .md file, a .tex file ready to import into Overleaf, or even an already parsed PDF file generated with tinytex.

Quarto is not a text editor, it is simply a processor that starts with a set of markdown files and a config file, and then builds one or more outputs. To write, I use Visual Studio Code and have a quarto preview command running in the background. For now, it just produces an HTML preview that I see in my browser. When I'm closer to a stable version of the manuscript, I'll start working with PDF output.

The syntax for some of the more specific Markdown features in Quarto is more complex than I am used to, so I still have to look at the documentation from time to time. But I am getting the hang of it, and I use a cheat sheet for the features I use more often.

Pandoc's Markdown support lets you apply classes to entire paragraphs or inline portions of text. This is useful because it has allowed me to create some CSS transformations with classes like "draft" or "missing-information" to keep track of passages I need to rewrite, or blocks where I need to get away from my text editor and go back to my notes (usually in Zotero). I find it super useful to avoid (at least as much as possible) falling into loopholes that distract me from actually writing. It's more efficient for my time management to divide my time between actual writing sessions and other sessions where I work on improving the drafty passages or doing the research I'm missing to illustrate an argument.

Another use of inline classes is to keep track of concepts or specific terms that I could include in a glossary or at least a list of acronyms. By keeping track of them directly in the text, I can automate the generation of these sections. Some might say that this is the kind of thing I could do with TEI XML- I agree, since this is semantic annotation. But as I said, I wanted a lightweight syntax, and I really like Markdown.

EDIT from June 20, 2025: I feel the need to add a precision a few months after this original post: while I did like my set up with Markdown and Quarto to get started on writing my dissertation, I eventually switched to the good old LaTeX. Quarto/Markdown simply lacked too many features for what I wanted to do.

Part of the problem came from the fact that custom annotations that turn into spans with custom classes during a Markdown-to-HTML transformation scenario were not converted into anything in LaTeX and were therefore lost. For example, I would have needed to manage the glossary and acronym handler afterwards, only once I was done with Markdown and fully switched to LaTeX. Rather than writing my own preprocessing script to find a solution to this problem (as far as I could see, pandoc does not offer any option to map markdown spans to custom LaTeX commands), I figured swithing to writing in LaTeX directly made more sense: there was no point in pushing too far the complications.

Also, I really wanted to be able to use the todo package from LaTeX to keep track of feedback, side notes and questions I had for myself while writing. With this package, they are visible in the PDF output, which is useful also when I share my text with other people.

Lastly, Roch Delanney greatly facilitated this switch by sharing his LaTeX template with me. It was easy to start from the setup he created with Robert Alessi and to add my own configuration and customization. Their template was much more pure than templates that can be found on Overleaf, on top of being very well documented. It was great to keep things simple: I don't import any package that I don't actually need.

022 - McCATMuS #5 - Training models

Last week, I visited Rimouski in the Bas-Saint-Laurent region of Québec, along the South-eastern bank of the St Laurent river. I was invited to contribute to discussions around the Nouvelle-France Numérique project, and I took this opportunity to present HTR-United, CATMuS as well as preliminary results on training a McCATMuS model. In preparation for this presentation, I conducted a series of tests on the two first models I trained. Today, this blog post gives me a space to discuss these tests and their results in more details.

The Kraken McCATMuS models were not directly trained on the HuggingFace dataset I introduced in my previous post, but rather on ARROW files created with the same ALTO XML files used to create the HuggingFace dataset. At the beginning of September, I wrote a Python script which reproduces the split of ALTO XML files into the train, validation and test sets, and which applies the same type of filtering of lines and modifications as I previously presented. Instead of generating the PARQUET files for HuggingFace, it simply creates alternative .catmus_arrow.xml files and three listings of these files, ready to be served to a ketos compile command1.

I used Kraken 4.3.13 to train the models on Inria's computation server because I've had dependency issues with Kraken 5 and haven't fixed them yet. The first model I trained strictly followed the train/validation split thanks to the --fixed-splits option. After 60 epochs, the model plateaued at 79.9% of character accuracy. When applied to the test set, this accuracy remained at 78.06%, a mere two points drop.

I trained a second model using the same parameters2 but without the --fixed-splits option, allowing Kraken to shuffle the train set and the validation set into a 90/10 split (the test set was left untouched however). This time, the training lasted 157 epochs before stopping, with the best model scoring with an accuracy of 92.8% on the validation set. When applied to the test set however, the model lost 7 points of accuracy (85.24%).

Learning curve for the model trained on the fixed split.
Learning curve (Character and Word Accuracies) for the model trained on the fixed "feature"-based split between train and validation.
Learning curve for the model trained on the non-fixed split.
Learning curve (Character and Word Accuracies) for the model trained on the random split between train and validation.

Although disappointing, this was consistent with the observations made when training the CATMuS Medieval model:

As anticipated, the "General" split exhibits lower CER, given the absence of out-of-domain documents, whereas the "Feature"-based split surpasses 10%. This higher score presents an intriguing challenge for developing more domain-specific models that consider factors such as script type and language. (from Thibault Clérice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagué, Jean-Baptiste Camps, et al.. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. 2024 International Conference on Document Analysis and Recognition (ICDAR), 2024, Athens, Greece. ⟨hal-04453952⟩ p. 15)

So, the drop in accuracy observed on the test set is, as suggested in Clérice et al, 2024, likely due to the fact that with a fixed-split, the model is both validated and tested against out-of-domain hands and documents (although the documents differ in the two sets). On the other hand, the model trained with a random split is validated against known hands and documents, but tested on out-of-domain examples.

The test set contains transcriptions of printed, typewritten and handwritten texts, covering all centuries. Limiting ourselves to only one accuracy score obtained on the whole test set would tell us very little about the model's capacity and its limitations. This is why I divided the test set into several smaller test sets based on the century of the documents and/or on the main type of writing present in the documents. For documents spanning over several centuries, I used the most represented century.

I only used the McCATMuS trained on the random split for these tests, because the accuracy of the other one was too low for the results to be meaningful. Instead of only testing McCATMuS, I also ran the Manu McFrench V3 and McFondue on the McCATMuS test set. They are two generic models trained on similar data (although with no or different normalization approaches).

Test set.............. ...McCATMuS... ...Manu McFrench V3... ...McFondue
All................... ...85.24... ...91.17... ...76.12
Handwritten........... ...78.72... ...89.40... ...75.17
Print................. ...96.37... ...94.15... ...78.30
Typewritten........... ...90.93... ...92.69... ...58.13
17th cent............. ...87.27... ...86.39... ...72.81
18th cent............. ...88.65... ...94.21... ...81.64
19th cent............. ...79.81... ...93.70... ...75.46
20th cent............. ...74.92... ...86.52... ...56.74
21st cent............. ...73.86... ...90.20... ...68.04
(HW) 17th cent........ ...58.69... ...64.83... ...64.26
(HW) 18th cent........ ...85.38... ...93.35... ...80.47
(HW) 19th cent........ ...79.81... ...93.70... ...75.46
(HW) 20th cent........ ...63.02... ...82.23... ...55.89
(HW) 21st cent........ ...73.86... ...90.20... ...68.04

I was initially surprised by the consistent margin Manu McFrench had over McCATMuS, considering it was trained on less data (73.9K + 8.8K lines, against the 106K + 5.8K lines) which had not been harmonized to follow the same transcription rules. However, these scores are actually biased in favor of Manu McFrench because several of the documents included in the McCATMuS test set were also used in Manu McFrench's train set. Even though this is not true for all documents, it concerns almost half of the test set. It might also be the case for McFonddue, but this model scores higher than McCATMuS in only one instance (handwritten documents from the 17th century). Creating a new test set, with documents that are not present in any of the train sets but follow the CATMuS guidelines, would be a good way to confirm this bias.

Additionally, I detected an issue in one of the datasets used in the test set: FoNDUE_Wolfflin_Fotosammlung contains some lines of faulty transcriptions, resulting from automatic text recognition, which most certainly cause an inaccurate evaluation of all three models.

A couple of examples of the faulty transcriptions, along with their CER they generate when compared to what would be a correct transcription (the CER is generated with CERberus):

Line image Faulty transcription Correct transcription Faulty CER would be
text line images reading, in print, "COLLECTION HANFSTAENGL LONDON" "CSTITHER, KIESERMAEAER AogS." "COLLECTION HANFSTAENGL LONDON" 89.29
text line image reading, in print, "NATIONAL GALLERY" "PEcLioL." "NATIONAL GALLERY" 175.0

I have planned to manually control this dataset and update the McCATMuS dataset accordingly. I don't know yet how many lines are affected.

The better accuracy of the Manu McFrench model is not just a product of the biases in the test set. I had the occasion to apply it to two documents, one from the 17th century and one from the 20th century. In both cases, Manu McFrench's transcription seemed more likely to be correct than McCATMuS's. This has led me to compare the training parameters used for both models and to start a third training experiment using Manu McFrench's parameters. In this case, the batch size is reduced to 16 (as opposed to 32) and the Unicode normalization follows NFKD instead of NFD.

If the results of this third training are consistent with the previous experiments, it will be interesting to see if adding more data to the training set will improve the results. Also, I have yet to test the model in a situation of finetuning.

As said at the beginning of this post, these results are preliminary, so I hope to have more to share in the coming weeks.


  1. The command looks like this: cat "./list_of_paths.txt" | xargs -d "\n" ketos compile -o "./binary_dataset.arrow" --random-split .0 .0 1.0 -f alto. 

  2. The configuration of Kraken for training these two model relies on the default network architecture, on a NFD Unicode normalization, a learning rate of 0.0001 (1e-4), batch size of 32, padding of 16 (default value), and applies augmentation (--augment). The --fixed-splits option is used for the first model. Following Kraken's default behavior, the training stops when the validation loss does not decrease for 10 epochs (early stops); this prevents the model from overfitting, which is confirmed when looking at the accuracy score of the intermediary models on the test set (orange line on the graphs). The training is done on a GPU. 

021 - McCATMuS #4 - Cleaning data, collection metadata

Preparing the data for CATMuS would certainly have taken much more time had I not been able to benefit from Thibault Clérice's experience with CATMuS Medieval. Not only was I able to build on the workflow he set up when he built it, but I also relied heavily on his scripts to parse and build the final dataset into PARQUET files that were pushed to HuggingFace. Most of these steps are described in Thibault Clérice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagué, Jean-Baptiste Camps, et al.. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. 2024 International Conference on Document Analysis and Recognition (ICDAR), 2024, Athens, Greece, presented at the ICDAR conference in Athens in a few days.

For McCATMuS, I started by downloading all the datasets (keeping track of the official releases) then I manually reorganized all the datasets so that the transcription and images were always under {dataset_repo}/data/{sub_folder}, which made later manipulation easier. Based on the notes I took while filtering the datasets, and after generating a character table for each dataset with Chocomufin, I created several conversion tables to harmonize the transcription. The conversions are a mix of single character or multiple character replacements ([ and [[?]]) and more or less sophisticated replacements based on regular expressions (#r#«).1

Here is a sample of the Chocomufin conversion table used for the LECTAUREP datasets. If the character is replaced by itself, it remains unchanged in the dataset, while replacing it allows either to remove a character from the dataset (the ¥) or to harmonize its transcription with the CATMuS guidelines (see œ and ° for example).

char,name,replacement,codepoint,mufidecode,order
#r ,Repl extra space before LEFT-POINTING DOUBLE ANGLE QUOTATION MARK,"""",00AB,,0
#r# »,Repl extra space before RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK,"""",00BB,,0
[[?]],replace [[?]] with ⟦⟧,⟦⟧,,,0
[?],replace [?] with ⟦⟧,⟦⟧,,,0
),RIGHT PARENTHESIS,),0029,),
m,LATIN SMALL LETTER M,m,006D,m,
É,LATIN CAPITAL LETTER E WITH ACUTE,É,00C9,E,
a,LATIN SMALL LETTER A,a,0061,a,
",",COMMA,",",002C,",",
e,LATIN SMALL LETTER E,e,0065,e,
^,CIRCUMFLEX ACCENT,^,005E,^,
œ,LATIN SMALL LIGATURE OE,oe,0153,oe,
̂,COMBINING CIRCUMFLEX ACCENT,̂,0302,,
W,LATIN CAPITAL LETTER W,W,0057,W,
°,DEGREE SIGN,^o,00B0,*,
¥,YEN SIGN,,00A5,,
½,VULGAR FRACTION ONE HALF,1/2,00BD,0.5,
h,LATIN SMALL LETTER H,h,0068,h,
r,LATIN SMALL LETTER R,r,0072,r,
æ,LATIN SMALL LETTER AE,ae,00E6,ae,
ȼ,LATIN SMALL LETTER C WITH STROKE,c,023C,c,
,RIGHT ANGLE,,221F,[UNKNOWN],

It wasn't possible to use a single conversion table for all the datasets because some had different transcription approaches. While replacing ¬ with - could, in principle, be used for each dataset, normalizing the way corrections and uncertainties were transcribed was another story. For example, in some of the CREMMA datasets, >< is used to signal a crossed word, while in other datasets <> is used. So replacing > with and < with in >hello< meant that in some cases we would successfully get ⟦hello⟧, while in other cases we would end up with ⟧hello⟦. There are a few documents where I had to manually intervene in the XML file to fix the transcription. In such cases, I fork the dataset repository to keep track of the corrected version of the ground truth or I push the correction back into the original dataset to create a new, more consistent version.

In general, the converted dataset is saved as .catmus.xml files, which allows us to keep track of the original ground truth and easily adjust the conversion table later if necessary afterwards.

In the second post of this series, I mentioned that "the CATMuS guidelines can (should?) be used as a reference point" and that "if a project decides to use a special character to mark the end of each paragraph, then in order to create a CATMuS-compatible version of the dataset, I should only have to replace or remove that character. In such cases, the special character that was chosen should be unambiguous and the rule should be explicitly presented." Providing a Chocomufin conversion table along with a dataset that uses project-specific guidelines would be an excellent practice to ensure that the dataset is indeed compatible with CATMuS.

Once all the .catmus.xml files were ready, I created a new metadata table for McCATMuS listing all the subdirectories under each dataset's "data" folder. This table was used as a basis to start collecting additional metadata at the document level rather than at dataset level, like the language used in the source or the type of writing (printed, handwritten or typewritten). Working at the document level is important because some dataset contain different types of writing and/or are multilingual. In some cases, when a document would mix different languages and/or different types of writing in the source, if the distinction could be made at the image level, I manually sorted them and created two different subfolders. This is what I did in the "Memorials for Jane Lathrop Stanford" dataset, for example: the subfolder "PageX-LettreX" mixed typewritten and handwritten letters, so I sorted them into "PageX-LettreX-handwritten" and "PageX-LettreX-typewritten" in order to have the most accurate metadata possible.

Other metadata included the assignment of a call number (or shelf mark) for each source represented in the datasets. In some cases a call number may apply to multiple subfolders, but in most cases, each subfolder is de facto a different document. Retrieving the call number is useful for several reasons: it allows for an accurate assessment of the diversity of documents in McCATMuS, it allows for a document to be associated with additional metadata found in its institution's catalog, or the list of call numbers can be used during benchmarking or production to check whether a document is known to the models trained on that dataset, thus explaining potentially higher accuracy scores.

In the few cases where the source used to build the ground truth did not have a corresponding call number, I simply made one up, keeping "nobs_" as a signal that it was a made-up call number. Thus, if "cph_paris_tissage_1858/" in "timeuscorpus" is now associated with its corresponding call number at the Paris archive center (Paris, AD75, D1U10 386), CREMMAWiki's "batch-04", which is composed of documents we created for the project, is associated with a made-up call number: "nobs_cremma-wikipedia_b04".

In the end, when the PARQUET files are created, the metadata from the table I just presented is collected, along with information extracted from parsing the contents of the XML file. Each of the metadata is then represented at the text line level. If you compare McCATMuS with CATMuS Medieval using HuggingFace's dataset viewer, you can see that they don't use exactly the same metadata.

"Language", "region type" and "line type" (which are based on the segmOnto classification), "project" and "gen_split" are common to both datasets, along with "shelfmark" I just described above. They both have a "genre" column with similar values (treatise, epistolary, document of practice, etc.). In the case of CATMuS Medieval, "genre" is complemented by "verse" (prose, verse).

Following Thibault's advice, I defined the creation date of a text line using two numbers ("not_before" and "not_after") instead of a single "century" value. This allows for a precise dating when it is possible or on the contrary, to spread the dating over several centuries when it cannot be avoided, which is more accurate in both cases.

McCATMuS mixes printed, handwritten and typewritten documents, so it was important to have a "writing type" column to help filter the dataset based on this information, in cases where one does not want to mix them. This metadata also makes it possible to use McCATMuS to train a classifier capable of distinguishing between the different types of writing. CATMuS Medieval on the other hand presents only handwritten sources, so such a metadata would be useless and is able to rely on paleographic classifications to characterize each text line based on a "script type" metadata, that includes values such as "caroline", "textualis", "hybrida", etc.

McCATMuS also has a "color" column that helps sort text lines based on whether the source image is colored (true) or in grayscale (false).

Although I reused the scripts developed by Thibault to build this dataset, I had to make several modifications to include this new metadata in the PARQUET files and to add additional filtering to the text lines. This included updating the mapping to the segmOnto vocabulary to match what existed in my datasets, or filtering some types of lines such as those identified as signatures.2 I also included an update of "writing_type" at the line level whenever the value in "line_type" allowed it to be controlled.

if ":handwritten" in line_type:
    writing_type = "handwritten"
    line_type = line_type.replace(":handwritten", "")
elif ":print" in line_type:
    writing_type = "printed"
    line_type = line_type.replace(":print", "")
elif ":typewritten" in line_type:
    writing_type = "typewritten"
    line_type = line_type.replace(":typewritten", "")
else:
    writing_type = metadata["writing_type"]

In the end, having built such a dataset (the first version of McCATMuS contains 117 text lines!) with such a variety of metadata is very satisfying although there is room for improvement. I have already mentioned that it would be interesting to have a greater variety of languages in McCATMuS. I also know that some of the values in "writing_type" are not completely accurate so adding a control based on a classifier might be interesting. Finally, I've noticed that some transcriptions in the "FoNDUE_Wolfflin_Fotosammlung" dataset are not correct at all, probably due to an automatic transcription that wasn't corrected.

However, before we dive into improving McCATMuS, it's important to first examine the accuracy of the models that can be built on top of it! This will be the topic of the next and last post in this series!


  1. To learn more about how chocomufin convert works, just read the software's short documentation. 

  2. I don't think it makes sense to include signatures in a dataset to train a generic model, since the transcription of such lines can be very context specific. 

020 - McCATMuS #3 - Datasets selection

HTR-United made identifying candidate datasets for McCATMuS a piece of cake. Once the rest of the CATMuS community agreed with the period to be covered by a "modern and contemporary" dataset, I created a simple script to parse the content of the HTR-United catalog and make a list of existing datasets covering documents written in Latin alphabet and matching our time criteria.

Actually, here is the script!

url_latest_htrunited="https://raw.githubusercontent.com/HTR-United/htr-united/master/htr-united.yml"

import requests
import yaml

import pandas as pd

# get latest htr-united.yml from main repository
response = requests.get(url_latest_htrunited)
catalog = yaml.safe_load(response.content)

def in_time_scope(dates):
    century_scope_min = 1600
    century_scope_max = 2100
    # this means that we allow datasets that intersect with the period
    if int(dates.get("notBefore")) < century_scope_min and int(dates.get("notAfter")) < century_scope_min:
        return False
    elif int(dates.get("notBefore")) > century_scope_max and int(dates.get("notAfter")) > century_scope_max:
        return False
    return True

filtered_by_date = []
for entry in catalog:
    if in_time_scope(entry.get("time", {})):
        filtered_by_date.append(entry)
print(f"Found {len(filtered_by_date)} entries matching the time scope.")

targeted_script = "Latn"
filtered_by_script = []
for entry in filtered_by_date:
    if targeted_script in [s.get("iso") for s in entry.get("script")]:
        filtered_by_script.append(entry)
print(f"Found {len(filtered_by_script)} entries matching the script criteria.")

cols = ["Script Type", "Time Span", "Languages", "Repository", "Project Name", "Dataset Name"]

metadata_df = pd.DataFrame(columns=cols)

selected_entries = filtered_by_script
for entry in selected_entries:
    row = {k:"" for k in cols}
    languages = [l for l in entry.get("language")]
    if len(languages) == 1:
        row["Languages"] = languages[0]
    elif len(languages) > 1:
        row["Languages"] = ", ".join(languages)
    else:
        print("Couldn't find a field for language in this repository")
        row["Languages"] = "no language"
    # get centuries/y
    row["Time Span"] = f'{entry.get("time").get("notBefore")}-{entry.get("time").get("notAfter")}'
    row["Project Name"] = entry.get("project-name", "no project name")
    repository = entry.get("url", "no url found")
    if repository.startswith("https://github.com/"):
        row["Repository"] = repository.split("https://github.com/")[-1]
    elif repository.startswith("https://zenodo.org/"):
        row["Repository"] = repository.replace("https://zenodo.org/", "zenodo:")
    else:
        row["Repository"] = repository
    row["Dataset Name"] = entry.get("title", "no title found")
    script_type = entry.get("script-type")
    if script_type == "only-typed":
        row["Script Type"] = "Print"
    elif script_type == "only-manuscript":
        row["Script Type"] = "Handwritten"
    else:
        row["Script Type"] = "Mixed"
    metadata_df.loc[len(metadata_df)] = row

metadata_df

I saved the output as a CSV and proceeded to go through each of the selected datasets and its metadata. I checked several things:

  • I made sure the datasets were available and easy to download. For example, I excluded those requiring manual image retrieval.
  • I checked the format of the data because I decided to initially focus only on datasets available in ALTO XML and PAGE XML.
  • I controlled the overall compatibility between the transcription guidelines used for the dataset and those designed by CATMuS.
  • I also checked the conformity of the dataset when trying to import it into eScriptorium. This import allowed me to detect when there was a discrepancies between the names of the image files and the value for the source image in the XML file which prevented the import from successfully running.1
  • Loading a sample of the dataset in eScriptorium also allowed me to visually control other incompatibilities with CATMuS that may not have been documented by the producers of the data.2
  • Finally, I considered the structure of the repository and, when necessary, the facility to reorganize it into a single data/ folder containing the images and the XML files, often distributed among sub-folders.

I assigned each dataset a priority number from 1 to 6. The lowest number was for dataset compatible with CATMuS without any modification (no dataset was giving a priority rank of 1...) and 6 for massive datasets that would require a nerve-racking script to be built correctly. My grading system is shown below.

  • 1=ready as is
  • 2=need to be chocomufin-ed
  • 3=require manual corrections but the dataset is very small, or the dataset is chocomufin/catmus compatible but requires a script to build it
  • 4=require manual corrections but the dataset is relatively big, or require a script to be fixed
  • 5=require manual corrections but the dataset is really big
  • 6=require manual corrections but the dataset is really big and require a personalized script to be built

For example, "Notaires de Paris - Bronod" had to be modified to comply with CATMuS requirements. This included replacing [[ and ]] with and , or also to ignore lines containing ¥, a symbol used in LECTAUREP's datasets to transcribe signatures and paraphs. These were straightforward modifications, thanks to Chocomufin. On the complete opposite, "University of Denver Collections as Data - HTR Train and Validation Set JCRS_2020_5_27" is a massive dataset (2660 XML files), but there are segmentation errors in this dataset, creating erroneous transcriptions given the way the line is drawn, and the annotation of the superscripted text is not compatible with CATMuS. To make it compatible with CATMuS, it would be necessary to control and correct each page one by one.

I chose to focus on datasets with priority 2 for the first version of McCATMuS. Indeed, it'll be possible to add more datasets into CATMuS in later versions, so there was no need to spend too much time on manually cleaning datasets. I had 23 with priority 2 to go through.

Identifying eligible datasets was not as time consuming as cleaning them and collecting additional metadata turned out to be. However, it gave me a good idea of the challenges I would face when trying to aggregate the datasets. I would have liked to be able to find a greater diversity of languages, but this is wasn't possible at this stage, mainly because many non-French datasets require more elaborate corrections than applying Chocomufin and were thus given a priority score higher than 2.

The next post will be covering the tedious phase of data cleaning and aggregation, along with metadata collection!


  1. It was the case in "Données vérité de terrain HTR+ Annuaire des propriétaires et des propriétés de Paris et du département de la Seine (1898-1923) where the ALTO XML files are not explicitly linked to their corresponding source images. I believe it can be fixed, but it would require creating a script just for this purpose and the dataset presented other incompatibilities with CATMuS' guidelines. 

  2. For example, "Argus des Brevets" contains some segmentation errors that will need to be corrected manually. 

019 - McCATMuS #2 - Defining guidelines

Previous experiments have shown that conflicting transcription guidelines in training datasets make it less likely that a model will learn to transcribe correctly. This is particularly relevant when it comes to abbreviations and it's something to keep in mind when merging existing datasets. We didn't really address this when we trained the Manu McFrench model because it's difficult to retroactively align datasets to follow the same transcription rules. Unless you can afford to manually check every line, of course. In the case of Manu McFrench however, we only merged datasets that didn't solve abbreviations, so we ensured a minimum of cohesion.

CATMuS was built on the foundation laid by CREMMALab and the annotation guidelines developed by Ariane Pinche at the end of a seminar organized in 2021. These guidelines are intended to be generic, meaning they should be compatible with most transcription situations and are not project-specific. Following these guidelines will help data producers create ground truth that is compatible with data from other projects. It will also help those projects save time by not having to create transcription rules from scratch. From my experience, it is indeed easy for the members of a project discovering HTR to get caught up in the specifics of one project and forget what is and is not relevant (or even complicating) in the transcription phase.

It's worth mentioning that a project can choose to follow some of the CATMuS guidelines, while maintaining more specific rules for certain cases. If that's the case, the CATMuS guidelines can (should?) be used as a reference point. Ideally, the specific rules defined by a project should be retro-compatible with CATMuS. For example, if a project decides to use a special character to mark the end of each paragraph, then in order to create a CATMuS-compatible version of the dataset, I should only have to replace or remove that character. In such cases, the special character that was chosen should be unambiguous and the rule should be explicitly presented.

As CREMMALab focused on the transcription of medieval manuscripts, so did the first CATMuS dataset and guidelines. As I said in my previous post, I focused on data covering the modern and contemporary periods, for which there was no equivalent to the CREMMALab guidelines. So, when extending CATMuS to these periods, I started with collecting existing guidelines and comparing them. I used the CREMMA Medieval guidelines, the CREMMA guidelines for modern and contemporary documents, SETAF's guidelines and CATMuS Print's guidelines as a basis to elaborate the transcription rules for McCATMuS.

For each rubric, I compared what each set of rules suggested, when they covered it. It was rare for all guidelines to align, but some cases were easy to solve. For example, all the guidelines recommended not to differentiate between regular s (⟨s⟩) and long s (⟨ſ⟩), except for the rules I had set for the modern and contemporary sources transcribed by CREMMA in 2021, before the CREMMALab seminar. It was thus decided that for McCATMuS there would be no distinction between all types of s's.

Some rubrics needed to be discussed to figure out why the rule had been chosen in the first place by some of the projects, to decide which one to keep for McCATMuS. In February, I met with Ariane Pinche and Simon Gabay to go over the rubrics that still needed to be set. One example of a rule we discussed is how hyphenations are handled. CATMuS Medieval and the two CREMMA guidelines say to always use the same symbol (⟨-⟩), whereas for the SETAF and CATMuS Print datasets, inline hyphenations (⟨-⟩) are differentiated from hyphenations at the end of a line (⟨¬⟩). Other symbols, like ⟨⸗⟩, were unanimously rejected.

Two factors were considered when making those decisions: the feasibility of a retro-conversion for the existing datasets and the compatibility of the rule with a maximum of projects. In the case of hyphenations, I eventually decided to follow the same rule as CATMuS Medieval and CREMMA. On top of simplifying the compatibility of McCATMuS with CATMuS Medieval, I found that replacing all ⟨¬⟩ with ⟨-⟩, rather than retroactively place ⟨¬⟩ where there was indeed an hyphenation at the end of a line1 was much more straightforward.

Once the set of rules was fixed, I used it to sort between the different datasets I had identified (I'll discuss this in the next post) and to decide which one would be retained for McCATMuS v1. I also defined the transformation scenarios necessary to turn each of these datasets into a CATMuS-compatible version. Then, once McCATMuS v1 was ready, I integrated the modern and contemporary guidelines into the CATMuS website, where the transcription guidelines for CATMuS medieval were already published.

Now that I am done integrating the rules set for McCATMuS into the website, I am confident that we have successfully designed rules that are overall compatible across the medieval, modern and contemporary periods, despite some unavoidable exceptions. Two good examples of the impossibility to cover a whole millennium of document production with the same rule are the abbreviations and the punctuation signs.

I've now explained how the transcription guidelines were established for McCATMuS. Next, I'll cover how they were integrated into existing datasets to create the first version of the McCATMuS dataset.


  1. You can't assume that every instance of ⟨-⟩ at the end of a line must be replaced with a ⟨¬⟩. In many cases, this can be a simple typographic decoration marking the end of a paragraph or the end of a title. 

018 - McCATMuS #1 - Overview

Last week, I attended ADHO's annual conference in Washington DC. I presented a short paper, co-authored with Floriane Chiffoleau and Hugo Scheithauer, about the documentation we wrote for eScriptorium (I wrote a post about it last year and you can also find our presentation here). I was also a co-author on a long paper presented by Ariane Pinche on the CATMuS Medieval dataset.

CATMuS, which stands for "Consistent Approach to Transcribing ManuScripts", is a collective initiative and a framework to aggregate ground truth datasets using compatible transcription guidelines for documents from different period written in romance languages. It started with CATMuS Medieval, but since January this year, I have been working on a version of CATMuS for the modern and contemporary period.

While I should (and will) try to publish a data paper on CATMuS Modern & Contemporary (I'll call it McCatmus from now on), I figured I could start with a series of blog posts here. I want to describe the various steps I followed in order to eventually release a dataset on HuggingFace and hopefully soon the corresponding transcription model.

I started working on McCatmus in January, but because of a major personal event (I moved to Canada!), it took seven month of stop-and-go before the release of the V1. This was particularly challenging due to the scale of the project and its technicality (it was hard to get back into McCatmus after several weeks of interruption, which I had to do several times).

To add to this complexity, McCatmus was also a multi-front operation. Indeed, to create McCatmus, it was necessary to:

  • define transcription guidelines in collaboration with other data producers,
  • identify datasets compatible with the guidelines and set priorities,
  • actually make all the dataset compatible with each other and clean some of the data,
  • model and collect metadata that made sense for this dataset,
  • release the dataset and fix the issues that came up.

To this date, two tasks remain on my to-do list for McCatmus: train a transcription model corresponding to this dataset and compare it with other existing ones, and make sure to have a publication describing this dataset and its usefulness.

My plan is to dedicate one post to the creation of the guidelines for the dataset, then a post about the identification and collection of the datasets used in McCatmus v1, and then I'll wrap up with a post about the process to create the dataset, the metadata and the release. Stay tuned!

017 - Deploying eScriptorium online: notes on CREMMA's server specifications

eScriptorium is a web application designed to perform automatic text recognition campaigns, by default powered by the OCR/HTR engine Kraken. It comes in a decentralized form, meaning that the application is not distributed by a single organization but can, on the contrary, be deployed by several actors on many different servers. In fact, you can also deploy eScriptorium on your personal machine, simulating a local server.1

As eScriptorium is gaining attention, more institutions are interested in building their own server to host the application and offer it to their associates. At Inria, we deployed eScriptorium for the first time in 2020, specifically for the project called LECTAUREP which we ran with the French national archives between 2018 and 2021. While the initial server was hosted on a virtual machine, without any GPU, and open to a relatively small amount of users, our current eScriptorium application already counts nearly 500 users and will soon be hosted on a much different server infrastructure, funded by the CREMMA project. Between the original LECTAUREP-eScriptorium server and the CREMMA server, we moved to a dedicated server (Traces-6) for which we invested about 20K€.

Since I have been regularly in touch with people from different institutions who were looking into buying the hardware to create their own server for eScriptorium, I thought it was largely time to put all the deets in writing!

To write today's post, I'm very happy to welcome a second pair of hands: Thibault Clérice's. His expertise and involvement in designing CREMMA server are crucial here!

Let's first discuss some technical requirements, then we'll describe how the CREMMA server was designed. We finish with some very important remarks on the necessity (or not) to build a server and on useful alternatives for the community!

Should you buy GPUs?

GPUs (or Graphics Processing Units) are not mandatory at all when you use eScriptorium. This is the reason why it is perfectly acceptable to run eScriptorium locally, on your own computer. Actually GPUs are not even mandatory to train Kraken models: training can be done on CPUs (your computer's processor), they will simply go much much much slower.

That, however, is true for personal or light use of the training features. If on the contrary you create a server open to dozens of users or more, then connecting eScriptorium to GPUs is very much a good idea: since training a model on a CPU alone can take 2-3 days (or much more), you don't really want 10 users to start a training task at the same time. In the absence of shared GPUs, their training will be queued for days or even weeks and the overload might degrade the experience of other users on the rest of the application. As long as we are building an infrastructure (and hopefully sharing costs), we may as well enhance the experience of everyone, no?

This being said, you shouldn't rush and go buy a GPU right away. Instead, you should first look at options to optimize its usage or at infrastructures that are already available to you. For example, the FONDuE infrastructure, at the University of Geneva, doesn't use the GPUs only for eScriptorium: they connect their application to a cluster which is used by researchers for intense computation tasks outside of eScriptorium (it's an HPC with a university-wide queue controlled by SLURM). This is a very good solution for optimization, because training Kraken models is not a constant activity: if the GPU is dedicated to eScriptorium only, then it will be used for a few hours here and there, not even at 100% of its capacity. Think of it: users of the application will usually need to train a model at the beginning of their transcription campaign, therefore once they have an accurate model, they will focus on using the model for prediction, which doesn't rely on the GPUs (and Kraken isn't really optimized for GPU usage at prediction time anyway).

Other possibilities include connecting the server to a completely physically separate cluster where training jobs are submitted. This is a possibility that several people told me they were exploring, but I don't know if anyone has set it already. Why would you opt for a solution with an external cluster? To replace some huge investment costs (original funding) with some smaller (but much more regular) functioning costs: for example, for CREMMA, nearly half of our 40K€ budget was spent, in 2022, on buying two A100 graphic cards from Nvidia. When using someone else's GPUs, not only you save the money you would spend on the hardware, but on top of that, you contribute to optimizing the use of other GPUs already in place. Another reason is because you might not have the human resources to administer the system and the GPUs. There are multiple calculation clusters created for Academia (of the top of our head: Jean Zay or Calcul Québec), and you could even consider using commercial solutions as well (like AWS, Google Cloud and the like). Then, your money is spent on the actual computation and not on making the computation possible in the first place.

Fair enough, plugging eScriptorium's task manager to an external server might not be that simple. However, for smaller groups of users, it is also worth taking into account that it is perfectly possible to train Kraken models using Kraken directly (through an SSH connection to a (super-)cluster, for example) before uploading them into the application. In such a case, eScriptorium is only used for its ergonomics, not as a simplified interface to train models.

Let's summarize the point here: GPUs are not always a must-have for eScriptorium or Kraken, so you should definitely consider first and foremost your future usage. They currently represent the biggest share in the hardware expenses to build a calculation server. There are options out there where you don't spend 10K€ to buy a GPU but rather connect to an external, ready-to-use service. Or, if you do decide to spend the money, you should consider ways to maximize its usage for other training tasks, possibly outside of eScriptorium.

Some considerations on storage

Normally, eScriptorium is used as an (assisted) annotation environment to obtain the transcription of documents. You would use eScriptorium:

  1. In a preparatory phase:
    • (1a) to produce training data, and
    • (1b) to elaborate (aka train) performant segmentation or transcription models;
  2. In a production phase, but only for relatively small corpora, to apply segmentation and transcription models and manually correct the results (in which case the size of the corpora must be compatible with the scale of what an individual or your assembled team can process);
  3. In a post-production phase, including for samples of a very large corpus, to easily visualize and control the result of the (large-scale) automatic prediction and potentially correct it (cf. n°2).

On the other hand, large scale transcription campaigns should probably be led with Kraken in the command line directly (so only n°1 and n°3 necessitate eScriptorium). Thibault has even produced a small python library to design such campaigns (RTK, for Release the Krakens) which was recently used in a paper2 where a 38.5M token corpus was produced. In some cases, n°1b even benefits from being performed outside of eScriptorium, since the application offers a very limited control over Kraken's training parameters.

This has several consequences on the way you should consider storage on a server dedicated to eScriptorium. Duplicates of images are created on the server while they are being processed in the application, but they should always be considered as such: temporary duplicates while phase 1, 2 or 3 are under progress. They shouldn't be considered as if eScriptorium was 1) an archiving solution for transcription projects, 2) a querying interface to explore a corpus or even 3) a publication environment for a minimalistic digital edition. eScriptorium is only one brick --an early one even-- in the corresponding pipelines. Instead, the original image files should be stored somewhere else, in an adapted data warehouse (like Zenodo, Nakala, etc.), or published in digital libraries under the responsibility of their owner (like Internet Archive, Gallica, etc.).

What this means when designing a server to host eScriptorium is that its storage capacity should of course be big enough to store the temporary image files,3 while users are working on their annotation, aka the active projects. However, this storage doesn't need to be expended all the time and it should also be ok to flush the terminated projects: at that point the images and their annotations should have been archived on more appropriate data warehouses by their creators, and it should be their responsibility.

Don't forget the RAM!

Not overlooking the RAM is very important when designing your server! But what is it used for? It's used for cache by the web application: it means that frequently accessed data, like web pages and images but also the content of the database, are temporarily loaded in live memory. Cache thus ensures that the requests sent by the users are served quickly. For example, if you don't have enough RAM (or enough cache), pages will load slowly, and if you have used eScriptorium before reading this post, you know how important it is to be able to load images fast enough.

RAM is also essential for inference and training because images and annotations are loaded in memory before being passed to the CPU or the GPU. If the RAM is not powerful enough, it will be detrimental to computation and will cause a bottleneck situation. Thus having invested in GPUs and/or CPUs but not in enough RAM would be like having a horse to pull a Ferrari: even if prediction and training could go fast on the processing units, it will be restrained by the available live memory.

Modularity for the CREMMA infrastructure

The CREMMA infrastructure was originally designed by Thibault with a simple but essential principle in mind: modularity. Instead of thinking of an eScriptorium server as a monolithic block of hardware designed for front-end service, storage and intense computation, he suggested to break each of these blocks into individual servers connected together. CREMMA4 is thus made of at least three servers, as shown in the schema below:

  • CREMMA_FRONTEND, for the front-end, where the application is deployed and where the database is stored.
  • CREMMA_STORAGE, for storage, where all the images and models, as well as the backup of the database are stored on the long term. Currently, CREMMA_STORAGE has a storage capacity of 38Tb5 but we could easily add more disks if we find that it is necessary.
  • CREMMA_COMPUTE, where the two A100 GPUs I mentioned earlier are plugged and where the application task manager "sends" all the jobs, whether they are to be run on CPU (these tasks include segmentation and transcription prediction for example), or on GPU (training for the most part).

A model of the CREMMA infrastructure where three blocks (front-end, storage and compute) are connected together through an intranet 10Gb/s connection. For each block, one or two server(s) is presented along with their specification. Credits: Thibault Clérice and Alix Chagué. The full text of the specifications is accessible in a commentary in the source code of this page, just after this image.

As you can see on the schema, there will actually be a fourth server involved in the infrastructure: Traces-6, the server we currently use to deploy eScriptorium at Inria. Like CREMMA_COMPUTE, Traces-6 can be called by CREMMA_FRONTEND for computation tasks. In fact, this is where the modularity of the system is interesting: with such a set-up, it is possible to add more computation servers to the pool of GPUs reachable by CREMMA_FRONTEND without having to redesign the whole infrastructure. On their side, CREMMA_FRONTEND and CREMMA_STORAGE can be upgraded (to add more RAM or more storage) very easily.

This modularity also means that the GPUs remain free for other uses: for example if we were to have to run maintenances on CREMMA_COMPUTE, we can simply cut it from the infrastructure, and let CREMMA_FRONTEND interact with Traces-6 only while we work on CREMMA_COMPUTE.

CREMMA_COMPUTE is equipped with two A100 graphic cards, and Traces-6 with two RTX 6000. Actually, it doesn't mean that only 4 training can be happening at once. Each of these GPUs offer between 24 and 40 Gb of RAM for intense computation. It's a lot. It's so much actually that training a Kraken model at max speed would rarely use more than 40% of this processing power. Virtualization is a nice trick to "break" the GPU down into smaller virtual GPUs (or vGPUs). What is broken down is the RAM capacity. We opted for the following virtualization set up:

  • Each of the A100 graphic cards and their 40Gb of RAM are turned into 1 10Gb vGPU + 5 5Gb vGPUs (since 10+5x5=35, note that we must leave 5Gb out of the equation for the virtualization).
  • No virtualization is applied to Traces-6's RTX6000s.

How did we decide on these numbers? Thibault ran a series of small tests executing either segtrain or train and playing with two different parameters: the batch size6 and the single point precision7. He found that for training a recognition model with a batch size of 8 and either 32 or 16 of precision, less than 5 Gb of RAM on the GPU is enough. With a batch size of 1 and a precision of 32, it's even less than 1 Gb. To train a segmentation model, less than 10Gb is enough, and this type of training is more rare. Since our goal for the infrastructure is not to maximize the speed of the training but to maximize the amount of possible parallel training jobs at decent speed, we decided that 10 vGPUs with 5Gb of RAM and 2 vGPUs with 10Gb of RAM were a good compromise. If we find that more GPU RAM is occasionally needed, we still have two times 24Gb with the RTX6000!

Should you build your own server?

We have spent all this time writing about how to build, how to spec out your server or your infrastructure, but let's talk about the elephant in the room: should you do it?

Well, it's all a matter of perspectives. We'd say it probably makes sense if:

  1. You are a very big organization, you have a lot of money available to you, a super-cluster (and possibly a well staffed IT services department), and you have a high demand;
  2. You are working on very sensitive data that can't be shared with the outside (e.g. medical reports);
  3. You are geographically far away from any other existing server, and face latency issues when you connect to potential welcoming servers;
  4. Servers that exist around you are reluctant to onboard you and the teams behind the request for a server of your own.

These four points are definitely valid. But we'd say that, if you are in another situation, sharing infrastructural costs probably makes way more sense. In our experience, building a server is long, tedious, require special (and rare) skills8 and costly (in terms of human resources as well!). Setting up a working server can take a really long time. For CREMMA, we ended up outsourcing part of the installation of the new infrastructure because we realized that we did not have the time nor the skills to set everything up ourselves. The cost of this installation by a third-party? Between 8 and 12K€, and again, a little time and bandwidth on our end.

Next you have the maintenance fees. You can outsource them, for a little bill from a company which would make sure that everything is installed on time, that updates work well, etc. Or you can do the maintenance yourself. But again, this comes with a cost: human time. A worker on the server goes down? You are in for a few hours. Some people crashed a third-party server by uploading too much IIIF images on your instance of eScriptorium? Well, then you will not only receive emails from these third parties (and this is completely normal), but also have to deal with your user base doing things that eScriptorium allows and that you may not (yet) be able to control/limit.

In the end, we would definitely recommend that, when this is possible, you first consider joining existing servers, including by offering quid pro quo by:

  1. Participating in covering the salary of people maintaining the server (through some kind of yearly fees for example);
  2. Providing some money to expand the existing infrastructure (to increase storage or computation, etc);
  3. In general, helping eScriptorium grow, discussing with the owners of the server you are joining and/or the eScriptorium team about what kind of new functionality should be added, and if you can contribute to fund these updates.

This final point is super important: sure, owning your own server sounds appealing, even if it is costly to put in place. However, developing eScriptorium also comes with expenses. Thus, participating in eScriptorium directly -- we think -- is also very beneficial and welcome by the developing team. Open-source is free to use, free of charge but is not appearing out of thin air: developing costs money. And the more people participate in infrastructural costs (servers or software), the better the experience will be.


  1. If you don't know anything about local servers and are curious to learn more, you can check this page: https://www.freecodecamp.org/news/what-is-localhost/. Or you can also take a look at the corresponding entry in Wikipedia! 

  2. The full reference is: Jean-Baptiste Camps, Nicolas Baumard, Pierre-Carl Langlais, Olivier Morin, Thibault Clérice, et al.. Make Love or War? Monitoring the Thematic Evolution of Medieval French Narratives. Computational Humanities Research (CHR 2023), Dec 2023, Paris, France. ⟨hal-04250657⟩ 

  3. By temporary, we don't mean that the image file are stored for a few hours only, on the contrary, they can stay on the disk for many years. We mean that it should be ok to consider that they can be erased whenever a user is done working on a corpus and has moved away from the transcription phase. 

  4. From now on, "CREMMA" means the server created through the CREMMA project. 

  5. Safety first! We have 38 Tb available, but there is actually a little more physically because we have redundancy and spare. We have 2 series of disks working with redundancy (RaidZ). In each series two disks are entirely dedicated to redundancy only, and one more is completely unused until something fails (it is used as a safety spare disk). While CREMMA_STORAGE, as we said before, is not used as a permanent storage solution, it needs to be a little bit safe for the user base. 

  6. To understand what the batch size corresponds to and why it is important, you can check this entry in the Stack Exchange forum: https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

  7. To quote Kraken's documentation: "When using an Nvidia GPU, set the --precision option to 16 to use automatic mixed precision (AMP). This can provide significant speedup without any loss in accuracy." Kraken's default value for precision is 32. 

  8. It can be difficult to justify hiring a full-time or even part-time system administrator for a team because it is a very specialized and highly demanded type of profile. For example, public organizations can rarely offer competitive salaries compared to the private sector. In addition, the workload for administrating a web server can be irregular, and it can be difficult to make the skills for system administration meet with other needs faced by a team, complicating even more offering a meaningful full-time job.