A little disclaimer for once, because I usually prefer to praise if I name people. I do not know Dan Cohen nor his work, my criticism of his article is not directed against him personally, but rather it takes his text as one example among many of the kind, that develop the same type of discourse and contain the same type of flaws.

A second disclaimer: I moved the original French version of this post here: posts/025-fr.md.

Earlier this week, my colleague Louis-Olivier Brassard asked me for my opinion on the latest post by Dan Cohen, which he titled "The Writing Is on the Wall for Handwriting Recognition", adding a subtitle that sets the tone: "One of the hardest problems in digital humanities has finally been solved". I wanted to make my critical reading a bit more public, so I'm turning it into a blog post.

I carefully read this article because the subject is of interest to me (obviously), but I must admit that I usually start this kind of reading with a negative a priori. This is the treatment I save for all those posts, whether on blogs or social media, that announce left and right that generative AI has revolutionized this or that -- this and that generally being problems that have occupied researchers and engineers for years, and which gave rise to sometimes heated or even unsolvable debates. All these posts contribute to fueling the hype around generative AI and undermining our already quite worn collective ability to develop critical thinking about it.

Dan Cohen's post follows the release of version 3 of Gemini, Google's generative AI model, publicized as Google's "most intelligent model yet". Like every time a new model of this type is released, several users share the results of their "experiments" with these models. Dan Cohen is not the only one; for example, Mark Humphries also posted a post on the subject on the same day, soberly titled "Gemini 3 Solves Handwriting Recognition and it’s a Bitter Lesson". I saw these two posts widely shared on BlueSky, praised by researchers whom I consider to hold positions of authority in the field of automatic transcription. After reading Dan Cohen's post, I found myself quite annoyed by these shares: I'm not convinced that the text was well read by those who shared it on BlueSky.

In my opinion, the problem with Dan Cohen's post is twofold: 1) he develops a universal discourse on a tool that he has only tested on a minimal selection of examples that say almost nothing about the problems encountered by users of automatic transcription on old documents, 2) his demonstration relies on fallacious arguments.

A matter of scientific rigor

About the first point: Dan Cohen uses three examples that are not at all representative of the challenges of automatic transcription. Right from the start, this would justify a footnote to his subtitle: he says "one of the hardest problems in digital humanities has finally been solved", I add "as far as it concerns epistolary documents written in English during the first half of the 19th century by personalities whose biographies have been written, or whose correspondence has already been edited"¹ because that's what he tested. That already reduces the scope of his results quite a bit, doesn't it? Moreover, given that the model fails to transcribe the third example, we could even add that this only concerns documents with a simple layout.²

This first point is really problematic because this post is a text published by a person who has scientific authority and should therefore demonstrate scientific rigor, even if we are only talking about a newsletter and not an edited article or book. Following this scientific rigor, I would expect us to limit ourselves to drawing conclusions about what has been successfully demonstrated instead of Doom propheting with flashy (sub)titles. One can be convinced that Gemini is capable of successfully handling many other cases than those presented by Dan Cohen, but that is a matter of belief, not scientific demonstration. I think this is a topic that needs to be discussed more broadly, in a context where AI is messianically served to us in all forms of dishes, but Marcello Vitali-Rosati talks about it well in his latest post or, from another angle and outside the uses by the academic world, there is the recent work of Hamilton Mann.

It happens that the day Louis-Olivier asked me to read Dan Cohen's text, I had also read that of Sunayani Bhattacharya who trained her students at Saint Mary's College of California in text analysis with Voyant Tools and who also evoked automatic transcription in passing in her post. She explains that, with the objective of offering an opening to the Global South to her students, she had them work on texts in Bengali (even though none of them can speak or read Bengali). I find the exercise interesting and promising as she presents it. After developing in her students a familiarity with what Bengali in properly edited press texts look like in Voyant Tools, she showed them what you get when you try to run Voyant Tools on texts directly taken from OCR software. These texts contain a lot of noise and sometimes do not even use the correct character sets. This allows her to give her students a very concrete example of the limitations of software infrastructures when it comes to processing texts in Indic languages. She concludes by reiterating the usefulness of giving students a better idea of what on-the-ground anglophone biases look like in technology. In a text like the one I discuss in this post, this anglophone bias (and I would even add modernist) is blatant.

A shaky demonstration

Now, regarding the second point, it requires taking a closer look at what Dan Cohen tells us and the examples he gives. There are inaccuracies that need to be pointed out, but also excerpts that do not correspond to the statements made in the post.

Let's start with an inacuracy that actually regards the question of model accuracy. I have already discussed this in a previous post because it seems to me that this is one of the topics where researchers are most lazy: what accuracy are we talking about, and what are the limits of these accuracy measures? Dan Cohen states that "the best HTR software struggles to reach 80% accuracy". As he clarifies that this means 2 wrong words out of 10 words, we already see that he is talking about word error rate and not character error rate. Such an error rate, on its own, says nothing about the readability of the text since a single error is enough for a to be counted as wrong. In a sentence like "the hardest problem in digtial humaities has finolly beeen sol ved", one word out of two contains a mistake, yet it seems to me that the sentence is perfectly readable.³ To put things into perspective, the character accuracy rate in this same sentence is 90.77% (according to software like KaMI). In addition to this initial inaccuracy, Dan Cohen's statement about the difficulties of traditional software seems false to me. I do not see on what source he bases himself. For documents like those he tests, we are well above 80% accuracy, even at the word level, and this with several models and several software using RCNNs or Transformers.

Since this initial statement surprised me, I wanted to look closer at Transkribus' output to see if it really did this many errors. Of course, there are errors in Transkribus' transcriptions. Yet, when we look at the source document, we see that some of these errors are understandable in a zero-shot context. When Boole draws two "l"s in a row, his second "l" looks like an "e" with a very very small loop. This explains why Transkribus' prediction contains errors on "tell" (read as "tele") on the left page, and "All" (read as "Ale") on the right page. To find out the real extent of Transkribus' errors, I made my own transcription of the double page tested by Dan Cohen, line by line (following the line order taken from the segmentation in Transkribus, and helping myself a bit with the reading proposed by Gemini⁴). When I calculate the accuracy rate on this excerpt, I get a character accuracy of about 95% and a word accuracy of 88%.⁵ So there is plenty of room for improvement, but we are not in a catastrophic situation as the preamble suggests.

If we now turn to the transcription generated by Gemini, we can see that there are actually some errors as well, whereas Dan Cohen is telling us that "Gemini transcribed the letter perfectly". For example, Gemini transcribes, on the right page, "occasionally by",⁶ generating as additional precision in a notes section that "On the right page (line 8), the handwriting becomes very scribbled. It appears to say 'take a long walk occasionally try & once or twice...' or possibly 'occasionally by & once or twice...'." Gemini fails here to propose reading a hyphenation that makes sense and prefers to add a word in its transcription. The problem is not that Gemini did not make a perfect transcription of course, but rather that Dan Cohen states it without noting this error.

We have the same issue in the second example, where Gemini formats the word "transmitted" to indicate that it is crossed out in the source when it is not. The text generated by Gemini leaves no doubt about the look of the text in the source, and invents an intention on the part of the author: "In the second line of the body, the word 'transmitted' is crossed out in the original text, but the sentence is grammatically incomplete without it (or a similar verb). It is likely the author meant to replace it to avoid repetition with the word 'transmitting' appearing a few lines later but forgot to insert the new word." Whereas this error was easier to spot, Dan Cohen once again tells us: "Another perfect job."

Then comes the third example. Gemini does not offer a complete transcription of this one, and after a few lines, generates a message indicating that the text is illegible beyond a certain point. This allows Dan Cohen to conclude: "Gemini does the right thing here: rather than venture a guess like a sycophantic chatbot, it is candid when it can’t interpret a section of the letter." I personally choke reading that, given the errors already noted in the two previous examples. Contrary to what Dan Cohen claims, there is no candor here, but rather a perverse effect of what I imagine is a calibration of the model based on its perplexity rate. In the first two examples, we can imagine that the model's perplexity regarding certain difficult passages leads to the generation of a note and/or an insert in brackets, but does not prevent the generation of a false transcription. It goes unnoticed all the more because the explanations generated in notes sound good, even if they are false. We are not dealing with a candid robot, but rather with a scammer chatbot, a presti-generator, who finds an escape route when the situation is too big for a subtle feint. And in my opinion, it would really be time for users of these software to integrate this reality, taking an even closer look when they control what these tools generate.

I haven't yet read Mark Humphries' post that I mentioned at the beginning, but I might come back to the subject in the future. To be honest, what I find really really unfortunate about these publications, coming from the academic world, which help to fuel the hysteria around generative AI, is that it gives me the impression that decisively it will not be from the scientific community that Salvation will come. As a citizen and a young researcher, this worries me a lot.

EDIT: 2025-12-01: Minor corrections and addition of another footnote.

EDIT: 2025-12-04: Translated the post to English (with the help of Copilot) and moved the French version to another path: posts/025-fr.md.

I give this precision about the edition of biographies and correspondences because it is important: Dan Cohen did not take documents that we are sure are unpublished. Given that generative AI models are trained from everything that can be found on the Web, this means that these letters may have, in one way or another, been part of the batches used for training. For example, on the website of the Archives of University College Cork, from which the digitization of Boole's letter is taken, we find the following text in the description field: "Boole in Cork to Maryann. He is in a very depressed mood, life has become monotonous with only his work adding interest to the day. He enjoys playing the piano but 'it would be better with someone else to listen and to be listened to'. He is also very annoyed by [Cropers] dedicating his book to him without first asking for permission - 'I cannot help feeling that he has taken a great liberty' - and speaks in strong terms of [Cropers] 'pretensions to high morality'. He invites and urges Maryann to visit him as soon as their mother's health would allow. He feels the climate would do her good." These are contextual elements that can help a model when transcribing. ↩
I purposefully use the term "simple layout" rather than "standard layout" because the phenomenon illustrated by the third example, the rewriting on the same sheet after having turned it 90°, corresponds to a practice that can be found at least until the mid-20th century. ↩
By readable, I mean that one does not need to know what the original sentence was to understand what we should have read in place of the errors. I admit however that depending on familiarity with the text or the language or the nature of the errors, this readability may vary. If you still find this sentence unreadable, it should be read as follows: "the hardest problem in digital humanities has finally been solved". There was 1 letter inversion in "digital", one missing letter in "humanities", one letter substituted by another in "finally", one extra letter in "been" and an inappropriate separation in "solved". ↩
I rapidly develop in the question of the layout. In Gemini's transcription, there are additional pieces of information that suggest that the model correctly identified which part of the text corresponds to which page. In Transkribus' transcription, this is not the case, but I think it's because Dan Cohen only used Transkribus' basic web page from testing models. If he had used the full version of Transkribus, I'm sure the software would have also perfectly identified the double-page layout. As for the line-by-line transcription, we no longer have this information in Gemini's transcription, which generates the text continuously. ↩
Among the errors made by Transkribus, we can also note the use of a "в" (the Cyrillic v) to transcribe the "B" in the margin of the document, and a "р" (the Cyrillic r) to transcribe the "P" that follows. These are errors that escape us when we do a quick visual check, which do not hinder reading by humans, but which lower the accuracy calculated automatically since a в is not a B and a р is not a P, nor indeed a p (see what I did here?). ↩
Transkribus transcribed it as "occasion by". ↩

A research (b)log

025 - A Perfect Job is the New Very Good Job

A matter of scientific rigor

A shaky demonstration