Skip to main content

012 - "It did a very good job"

A few weeks ago, I attended the presentation of an automatic transcription software. The majority of the audience was unfamiliar with the concept of handwritten text recognition (HTR) or had little experience using it. The presentation lasted only an hour, so it couldn't delve into much detail. Its main objective was to demonstrate the software's results. The presenter showed several slides, displaying on one side images of manuscripts (often in a language unknown to the audience) and on the other side the transcriptions generated by the software. Throughout the presentation, the presenter repeatedly commented on the HTR software saying that "it did a very good job."

But what does it even mean?

The very first aspect to explore is what distinguishes a good job from a bad one. Normally, such an evaluation relies on the measurement of the accuracy of the result compared to the ideal transcription. The accuracy can be expressed positively or negatively using the error rates (a 0% error rate is the same as a 100% accuracy).

Measuring the accuracy of a prediction (another way to call the result of HTR) is commonly done at character level. The character accuracy of a model is equal to the number of matches between the prediction and the ideal transcription. The character error rate (CER) is a very common measure to express a model's theoretical efficiency.

Sometimes softwares also consider the word error rate (WER), which is the proportion of words in the prediction containing errors. A high score at WER doesn't actually mean that the transcription is bad. It only means that the errors are distributed on all the words. I never use WER alone because it is hard to get an exact impression of the quality of the prediction based on that metric alone.

There is a paper from Neudecker et al. (2021) where they test 5 different software used for evaluating the prediction. They also develop an interesting reflection on alternative metrics such as the "non-stopword accuracy", the "phrase accuracy", the "flexible character accuracy" (which is useful when the line order isn't always the same), the "figure of merit" (which "aims to quantify the effort required for manual post-correction" (p. 15)) or else the "unordered WER".

When your score is a rate, there is an implicit idea that 100% is both the maximum score and the targeted score (for accuracy of course). But in the case of HTR, 100% accuracy is extremely rare because there are also edge cases where the way a letter was drawn is ambiguous: in such cases the error is not particularly caused by the inaccuracy of the HTR engine but rather by the imperfection of the handwriting in the first place.

In Hodel et al., (2021), the authors provided a grid to interpret accuracy scores. They suggest the following three thresholds:

  • CER < 10% == good (it allows efficient post-processing)
  • CER < 5% == very good (errors are usually focused on rare or unknown words)
  • CER < 2.5% == excellent (but it is usually only reached when the handwriting is very regular)

Personally, I think this grid should also include 20% and 0%. 20% as a threshold, because at 80% of accuracy, the transcription is supposedly good enough for fuzzy search and keyword spotting (I should add a reference here, but I can't find it anymore...); and 0% because it should be reminded that an accuracy of 100% is virtually impossible.

To complement this, I would like to mention another possible approach to get an interpretable score: during the DH2023 conference, Thibault Clérice and I presented an experiment where we trained a model using the same data in the train set and the test set. Our model reached an accuracy close to 90%, which we were able to use as a baseline to define the highest accuracy score possible for the data we had. Thus we were able to consider that a model approaching 90% of accuracy would be an excellent model, as far as that dataset was concerned.

Still during the DH2023 conference, Wouter Haverals introduced CERberus 🐶🐶🐶, a web interface which addresses the same type of issues as KaMI: the lack of nuance in a plain CER computation. Indeed, in a CER score, every type of error has the same weight. This means that mistaking an "e" for a "é" costs the same as mistaking a "e" for a "0": in the first case the text is likely still readable or understandable, whereas in the latter, it might not be the case.

The CER metric is still very useful, but when applied to transcription projects, it is even more valuable when we can filter the types of errors we want to include in the evaluation.

EDIT: I should have noted here that my reflection was focused on the evaluation of an automatic transcription in cases where you already have the expected transcription. When we apply an HTR model to a whole new set of documents, we usually don't have the correct transcription at hand (otherwise we wouldn't use HTR in the first place). This is the reason why many researchers try to find ways to evaluate the quality of the transcription without ground truth. One example can be found in Clérice (2022).

So, to go back to our initial problem, we can see that there are many ways to draw the line between a good job and a bad one. The threshold will depend on the metric used to express the accuracy of the prediction and also (and actually mostly) on the way the generated text will be used down the line. Even though the software presentation I attended was short, I think we should always remind future users of HTR that 100% of accuracy is not always what they are seeking.

A short reflection to finish this post: I was bothered by the expression used to qualify the transcription. I am still trying to figure out a way to put it into words. On top of lacking accuracy, the expression "it did a good job" was also calling for a vision of HTR as a magic tool at the service of the searchers and students. But, in which other cases do you say that someone did "a good job?" Likely when you delegate a task to a subaltern.

I see a problem here: in their current state, HTR engines are efficient but not to the point that people can use them without thinking clearly about what they want the engine to produce. It is easy to sell a software pretending that it is a magic servant that will do all the transcription in your place, a tool so smart that you can even consider delegating a part of your responsibility to it. But I think when new users of HTR fail to first reflect on the outcome they can reasonably expect from these engines, it creates disappointment and crappy data and workflows.

011 - Working with synthetic data

What we call synthetic data are data generated artificially, as opposed to data taken from real-life samples. In the case of automatic transcription or layout analysis, it corresponds to creating fake documents or samples of text that look more or less like real ones, in stead of manually annotating existing documents.

One of the main advantages of using synthetic data rather than real data is the fact that it comes already annotated. For automatic transcription for example, the annotation (transcription) is the same as the string of text passed to a text image generator. If you add to that the fact you can, in theory, generate an unlimited amount of pairs of text image and transcription, it represents an incredible opportunity to accelerate the production of training datasets. An example: Doush et al., 2018 use this technique to generate PDF containing contemporary printed Arabic texts. The PDFs are printed, then re-scanned and aligned with the transcription that was used to generate the PDFs. The result is the Yarmouk dataset. As we will see later, generating fake handwritten text is a bit more difficult.

Another advantage of this technique is that it offers an efficient way around the limitations posed by sensitive or confidential data (Hu et al. 2014). However, let's note that confidentiality is rarely a problem when it comes to training HTR models on historical documents.

Generating fake data is not specific to computer vision (Raghunathan, 2021), even though it is frequently used in this case because data for computer vision tasks are costly to produce. In general, it is a fairly frequent method when machine learning techniques are involved, disregarding the field of application (Kataoka et al., 2022). OCR and HTR tasks are not an exception and we can find traces of such experiments rather early (Beusekom et al., 2008).

The first time I was exposed to the notion of synthetic data was during a informal conversation with Tom Monnier in 2019. At that time, he was working on docExtractor, a layout analysis tool that he trained with images of documents generated artificially.1

Then sometimes in 2021, while browsing through HuggingFace's spaces, I found ntt123's application that simulates handwriting. The application takes a text prompt as an input and generates an animation where the letters are traced on the page as if someone was writing them live. It's possible to play with two parameters: a value between 0 and 250 determining the writing style, and a weight determining the likelihood of the traced letters (the lower the weight, the higher the risk of hallucinated letters; the higher the weight, the more standardized the tracing). It made me think back to my conversation with Tom Monnier and I wondered if it could be used to generate pairs of text and images.

At the beginning of the year, I dedicated a good part of my time to testing data generation tools I could find online, to see if they could be used to create a set of fake ground truth that I would use later, in other experiments. I will introduce the latter in a future post, so let's first focus on handwritten data generation.

When I dug a bit more around ntt123's application, I was confronted with two things:

  1. unfortunately, ntt123's application was developed in javascript and not documented at all which made it impossible for me to hack,
  2. but luckily, it wasn't an original idea: instead it was one of many implementations of a proposition introduced by Alex Graves in 2014.

Alex Graves uses online2 data from the IAM database (Liwicki & Bunke, 2005) and an LSTM (Long Short-Term Memory) to train a model capable of generating series of coordinates that trace letters and words. Initially, the model simply generates random series of letters and words, but it is then improved to take into account a text prompt which forces the models to generate a specific series of letters. As described before, the model also takes a weight (or bias) which normalizes the likelihood of the letters' shape, and can take a "priming line": the image of a handwritten line, whose writing style the model will try to copy. Once the coordinates are generated (including key information such as "ends-of-stroke"), it is easy to place them in an SVG file and visualize the result, with or without animation.

There are many many implementations of Alex Graves's experiment because it was such an important publication to demonstrate the usefulness of LSTM models. Several can be found on Github if you search "Alex Graves". For my experiment, I didn't want to develop my own adaptation of such a model, but rather to use programs that were ready to be used. This is the reason why I didn't look for papers but instead for recent (or recently updated) repositories on Github. I focused on Python programs because I wanted to be able to understand how they were developed.

One very promising implementation of Alex Graves' proposition was Evgenii Dolotov's pytorch-handwriting-synthesis-toolkit. It came with pre-trained models, and a utility scripts to feed the program a text prompt and generate an image. I customized3 the program a bit to fix a few bugs and try to make it generate several lines maintaining the same handwriting.

4 lines stating (or supposed to state) 'did a computer write this' generated by Evgenii Dolotov's program. The fourth line is a failed attempt where several letters like y, n, m can be dinstiguished. The fifth line states 'determined to act upon the assumptions' but contains several garbled letters.

Even though the generated images were sometimes impressively realistic, it created a lot of bad output. As suggested by Alex Graves, his solution tends to generate what he calls "garbled letters", letters that no human would likely trace. In other cases, it would randomly skip some letters and be completely incapable of tracing some numbers or punctuation signs. Sometimes, the model would simply draw more or less flat lines. Since I wanted to generate fake gold data that I could trust and since the results were not reliable enough, I played with the bias and the priming lines before trying to train new models using Evgennii Dolotov's utility scripts. I failed to get better results than the pre-trained models, and failed to find the correct parameters to make sure I would obtain always realistic output.

several flat lines that at one point successfully write 'is fin'. This is a failed generated image.

At this point I started exploring GANs (Generative Adversarial Networks) which are models based on game theory. They are capable of generating realistic fake images learning from samples of real images (see Goodfellow et al., 2020). They are this kind of models used to generate photos of people who don't exist. There are Github repositories offering source code to train such models to generate fake handwriting, such as GANwriting (described in Kang et al., 2020) or Amazon's ScrabbleGAN (introduced in Fogel et al., 2020) but they were only giving instructions to reproduce the corresponding papers and train the models ourselves. Since GANs are costly to train, I left this option out for the moment, even though I do think they can become an interesting solution in the future.

Eventually, I settled for a solution based on a Diffusion model. This type of model can be found behind applications like OpenAI's DALL-E. Luhman & Luhman (2020), who created the Diffusion Handwriting Generation (later called DHG), explain very well how diffusion models work.

"Diffusion probabilistic models [...] convert a known distribution (e.g. Gaussian) into a more complex data distribution. A diffusion process converts the data distribution into a simple distribution by iteratively adding Gaussian noise to the data, and the generative model learns to reverse this diffusion process." (Luhman & Luhman, 2020, p. 1)

A great advantage with DHG compared to the LSTM approach was that it was possible to easily fix the priming line and almost always obtain a convincing output. This was essential to create a dataset with a consistent handwriting over hundred of lines. As visible in the following image, even if the diffusion model is not capable of perfectly imitating the handwriting contained in the priming line, it usually successfully captures elements of style such as the slant, or the cursive nature of the text.

five pairs of priming lines with the resulting generated lines.

After several tests, I found that the third priming line gave the best results when associated with different text prompts, so I decided to use it along with excerpts from Moby Dick to create a completely artificially generated dataset. In a few days, I created more than 8,000 images (PNG) associated with a text file (TXT) containing the prompts used to generate them.

These pairs could have been used "as is" to produce a silver synthetic dataset but, like I said before, I needed a gold dataset where the text and the images would be exact matches. Unfortunately, more than a third of the images did not qualify as gold. After manually reviewing about 2,500 of the lines (with the help of my colleague Hugo Scheithauer), we published a set of 1,280 pairs of lines and text under the name "Spinnerbait".

Even though I was able to produce a dataset meeting my main criteria, I was actually disappointed with my results: I wanted a sort of magic button which would allow me to generate, at any time and without having to review it, a perfect set of training data. Instead, in the future, if I want to add more lines to Spinnerbait, I will have to spend a few hours going through each line to filter the bad ones out.

On the other hand, I decided to take a few hours to manually copy a text taken from Guillaume Apollinaire's poems. I copied the text following a txt file that I would edit every time I would start a new line, I scanned it, segmented it with eScriptorium before copying and pasting the lines from the txt file and exported the result as a series of XML ALTO and images. It gave birth to the Moonshines dataset, a set of 1,186 lines (including 170 dedicated to a fixed test subset) of a single hand, thus comparable in size to Spinnerbaits.

I think generating both datasets took about the same amount of time, if I take into account on the one hand reviewing the generated lines and on the other hand copying the text and passing it through eScriptorium. Moonshines used less computing resources and produced a richer dataset if we consider the aspect of the text. Also, the length of the lines is more varied in Moonshines whereas it is more homogenous (max 5 words) in Spinnerbait, because the generator tended to make more errors on longer prompts.

a line taken from the Spinnerbait dataset and a line taken from the Moonshines dataset

Another important limitation that I have barely addressed at this point it that not only do these tools fail to draw non-ASCII characters, but they also tend to have a greater chance of producing garbled letters when prompted with rare or non-english words4. This is true of all the systems I have tested. Of course, we could imagine training new models on data containing a greater diversity of languages, or simply other scripts or languages.

As way of a conclusion, I would say that even though I was disappointed with what I obtained down the line, this exploratory adventure was very interesting. I learned a lot and I am convinced that if I had more time and resources (and if it were more crucial for me), I would have found a way to get better results. I know of some coming publications that used GANs to create artificial data that look like lines taken from historical documents and I really look forward reading them.


  1. It is possible to find here a recording of the talk given on this tool at ICFHR 2020. 

  2. In the context of handwritten text recognition, a distinction is made between "online" data and "offline" data. Offline data are based on a matrix of pixels containing the image of a text (they are static), whereas online data are vectors containing information about the speed, the points through which a line passes to form a letter, end of stroke points, etc. Online HTR uses data generated with an e-pen and a screen while offline HTR uses images created with a scanner or a camera. 

  3. One of the customizations consisted in removing non-ASCII characters or characters not supported by the model. It was easy to apply this transformation because in pytorch-handwriting-synthesis-toolkit, each model comes with a little metadata file which contains the character set handled by the model. 

  4. All the models were trained using the IAM database, more often the "online" database, but sometimes also with the "offline" version. 

010 - Make and Read the docs

During my last contract as a research engineer at Inria, I spent a lot of my time working on the project called LECTAUREP, in collaboration with the National Archives in France. The goal of this project was to explore new ways to index the content of thousands of thousands of notary registries which, put together, form one of the most used collections of the National Archives. I joined the project at the end of 2019, during its second phase, almost at the same time as eScriptorium was initiated. LECTAUREP had worked with Transkribus during the first phase (in 2018) but, given the connections between my research team and the team behind eScriptorium, we quickly switched to the newer software and contributed to its development.

One of my most important contribution is the redaction of a tutorial for the software, which was initially only intended as an internal resource for our team of annotators. This is the reason why the tutorial was published on LECTAUREP's blog. OpenITI, and in particular Jonathan Allen rapidly offered an English translation which, eventually, was also published on LECTAUREP's blog. Since the publication of this translation, it is listed on eScriptorium's home page as its official tutorial.

Unfortunately, the tutorial hasn't been updated in a long time whereas major updates and new features have been added on eScriptorium's side.

LECTAUREP's blog is not a good solution. It is built with Wordpress and hosted by Hypotheses which is very convenient to allow a small, well defined, group of people to collaboratively work on a research blog, but it's too heavy and not adapted to publish the documentation of a software like eScriptorium. The documentation needs to be updated frequently to keep up with the software and, in general, a blog is not a place to publish the extensive documentation of a software. To top it all, it is not even that easy to update for me, so can you imagine someone outside of LECTAUREP trying to offer an update?

I have been thinking of finding a better solution since at least 2020, but it was never so urgent that I was able to put it at the top of my to-do lists. Last Summer, I took the advantage of a rather slow couple of weeks in August, when every one but me seemed to have gone on vacations, to put something different in place.

Readthedocs quickly appeared to me as an ideal solution: the platform is designed for publishing software documentations, it handles software versions and multi-lingual contents. Last but not least, it uses static website generators. This is fundamental because it allows for the publication of the source code on a platforms like Github and will actually use this public source code to build the website.

Github is a platform designed for sharing and opening codes to external contributors. Relying on it solves a major issue with the current tutorial: if anyone can suggest the correction, edition or translation of eScriptorium's documentation, then it is more likely to keep up with the evolutions of the application!

In August, I created a new Github repository called escriptorium-documentation. I set a basic configuration and connected it to Readthedocs. As soon as this was done, the website became available at online with a URL based on the following structure: {gh_repo_name}.readthedocs.io. Then, I started rewriting the content of the tutorial... following Sphinx' syntax.

It was so painful that I never got back to it after I came back from my own vacations.

Why painful? Well, I had discovered Markdown in 2017 and I have used it since. It's so powerful and yet so light! In comparison, Sphinx felt like such a complicated and heavy syntax. Not as heavy as HTML, but less intuitive nonetheless. I had to go through the documentation every time I wanted to add something as simple as a hyperlink or an image!

In January, when I gathered enough motivation1 to go back to working on eScriptorium's tutorial, I decided to look for an alternative to Sphinx compilers.

The only non-sphinx-based option available with readthedocs is Mkdocs. Like its name hints at, Mkdocs is a Markdown compiler, capable to quickly build websites. The set-up is really quick, it's well documented, fairly easy to customize and it's possible to add a lot of cool extensions which are based on Python. It was the bomb!

I liked Mkdocs so much that I also used it to rebuild my personal website!2

Over the past month, I have spent a lot of time working on this new tutorial for eScriptorium. I designed a basic structure, breaking down the features into different categories. Now the pages are progressively being filled and I am very happy to have been joined in my efforts by my colleagues Hugo Scheithauer and Floriane Chiffoleau. As we progressively merge the content of new pages to the main branch, the escriptorium-tutorial website expands. It will be ready soon for an official release!

I really hope that the transparency and simplicity brought by Mkdocs and Markdown will allow many people to add their contributions to the documentation of eScriptorium! Who knows, maybe you will too!

EDIT: we changed the name of the repository to escriptorium-documentation instead of escriptorium-tutorial (all links and mentions were changed in this post). The decision was motivated by the fact the "tutorial" felt like an inexact description of the actual scope/ambition of the project.


  1. Also when I got more free time after my classes were over

  2. It is not necessary to use readthedocs to deploy a website built with Mkdocs. In the case of the tutorial, it simply allows us to have a domain name more meaningful than ".github.io".