Skip to main content

017 - Deploying eScriptorium online: notes on CREMMA's server specifications

eScriptorium is a web application designed to perform automatic text recognition campaigns, by default powered by the OCR/HTR engine Kraken. It comes in a decentralized form, meaning that the application is not distributed by a single organization but can, on the contrary, be deployed by several actors on many different servers. In fact, you can also deploy eScriptorium on your personal machine, simulating a local server.1

As eScriptorium is gaining attention, more institutions are interested in building their own server to host the application and offer it to their associates. At Inria, we deployed eScriptorium for the first time in 2020, specifically for the project called LECTAUREP which we ran with the French national archives between 2018 and 2021. While the initial server was hosted on a virtual machine, without any GPU, and open to a relatively small amount of users, our current eScriptorium application already counts nearly 500 users and will soon be hosted on a much different server infrastructure, funded by the CREMMA project. Between the original LECTAUREP-eScriptorium server and the CREMMA server, we moved to a dedicated server (Traces-6) for which we invested about 20K€.

Since I have been regularly in touch with people from different institutions who were looking into buying the hardware to create their own server for eScriptorium, I thought it was largely time to put all the deets in writing!

To write today's post, I'm very happy to welcome a second pair of hands: Thibault Clérice's. His expertise and involvement in designing CREMMA server are crucial here!

Let's first discuss some technical requirements, then we'll describe how the CREMMA server was designed. We finish with some very important remarks on the necessity (or not) to build a server and on useful alternatives for the community!

Should you buy GPUs?

GPUs (or Graphics Processing Units) are not mandatory at all when you use eScriptorium. This is the reason why it is perfectly acceptable to run eScriptorium locally, on your own computer. Actually GPUs are not even mandatory to train Kraken models: training can be done on CPUs (your computer's processor), they will simply go much much much slower.

That, however, is true for personal or light use of the training features. If on the contrary you create a server open to dozens of users or more, then connecting eScriptorium to GPUs is very much a good idea: since training a model on a CPU alone can take 2-3 days (or much more), you don't really want 10 users to start a training task at the same time. In the absence of shared GPUs, their training will be queued for days or even weeks and the overload might degrade the experience of other users on the rest of the application. As long as we are building an infrastructure (and hopefully sharing costs), we may as well enhance the experience of everyone, no?

This being said, you shouldn't rush and go buy a GPU right away. Instead, you should first look at options to optimize its usage or at infrastructures that are already available to you. For example, the FONDuE infrastructure, at the University of Geneva, doesn't use the GPUs only for eScriptorium: they connect their application to a cluster which is used by researchers for intense computation tasks outside of eScriptorium (it's an HPC with a university-wide queue controlled by SLURM). This is a very good solution for optimization, because training Kraken models is not a constant activity: if the GPU is dedicated to eScriptorium only, then it will be used for a few hours here and there, not even at 100% of its capacity. Think of it: users of the application will usually need to train a model at the beginning of their transcription campaign, therefore once they have an accurate model, they will focus on using the model for prediction, which doesn't rely on the GPUs (and Kraken isn't really optimized for GPU usage at prediction time anyway).

Other possibilities include connecting the server to a completely physically separate cluster where training jobs are submitted. This is a possibility that several people told me they were exploring, but I don't know if anyone has set it already. Why would you opt for a solution with an external cluster? To replace some huge investment costs (original funding) with some smaller (but much more regular) functioning costs: for example, for CREMMA, nearly half of our 40K€ budget was spent, in 2022, on buying two A100 graphic cards from Nvidia. When using someone else's GPUs, not only you save the money you would spend on the hardware, but on top of that, you contribute to optimizing the use of other GPUs already in place. Another reason is because you might not have the human resources to administer the system and the GPUs. There are multiple calculation clusters created for Academia (of the top of our head: Jean Zay or Calcul Québec), and you could even consider using commercial solutions as well (like AWS, Google Cloud and the like). Then, your money is spent on the actual computation and not on making the computation possible in the first place.

Fair enough, plugging eScriptorium's task manager to an external server might not be that simple. However, for smaller groups of users, it is also worth taking into account that it is perfectly possible to train Kraken models using Kraken directly (through an SSH connection to a (super-)cluster, for example) before uploading them into the application. In such a case, eScriptorium is only used for its ergonomics, not as a simplified interface to train models.

Let's summarize the point here: GPUs are not always a must-have for eScriptorium or Kraken, so you should definitely consider first and foremost your future usage. They currently represent the biggest share in the hardware expenses to build a calculation server. There are options out there where you don't spend 10K€ to buy a GPU but rather connect to an external, ready-to-use service. Or, if you do decide to spend the money, you should consider ways to maximize its usage for other training tasks, possibly outside of eScriptorium.

Some considerations on storage

Normally, eScriptorium is used as an (assisted) annotation environment to obtain the transcription of documents. You would use eScriptorium:

  1. In a preparatory phase:
    • (1a) to produce training data, and
    • (1b) to elaborate (aka train) performant segmentation or transcription models;
  2. In a production phase, but only for relatively small corpora, to apply segmentation and transcription models and manually correct the results (in which case the size of the corpora must be compatible with the scale of what an individual or your assembled team can process);
  3. In a post-production phase, including for samples of a very large corpus, to easily visualize and control the result of the (large-scale) automatic prediction and potentially correct it (cf. n°2).

On the other hand, large scale transcription campaigns should probably be led with Kraken in the command line directly (so only n°1 and n°3 necessitate eScriptorium). Thibault has even produced a small python library to design such campaigns (RTK, for Release the Krakens) which was recently used in a paper2 where a 38.5M token corpus was produced. In some cases, n°1b even benefits from being performed outside of eScriptorium, since the application offers a very limited control over Kraken's training parameters.

This has several consequences on the way you should consider storage on a server dedicated to eScriptorium. Duplicates of images are created on the server while they are being processed in the application, but they should always be considered as such: temporary duplicates while phase 1, 2 or 3 are under progress. They shouldn't be considered as if eScriptorium was 1) an archiving solution for transcription projects, 2) a querying interface to explore a corpus or even 3) a publication environment for a minimalistic digital edition. eScriptorium is only one brick --an early one even-- in the corresponding pipelines. Instead, the original image files should be stored somewhere else, in an adapted data warehouse (like Zenodo, Nakala, etc.), or published in digital libraries under the responsibility of their owner (like Internet Archive, Gallica, etc.).

What this means when designing a server to host eScriptorium is that its storage capacity should of course be big enough to store the temporary image files,3 while users are working on their annotation, aka the active projects. However, this storage doesn't need to be expended all the time and it should also be ok to flush the terminated projects: at that point the images and their annotations should have been archived on more appropriate data warehouses by their creators, and it should be their responsibility.

Don't forget the RAM!

Not overlooking the RAM is very important when designing your server! But what is it used for? It's used for cache by the web application: it means that frequently accessed data, like web pages and images but also the content of the database, are temporarily loaded in live memory. Cache thus ensures that the requests sent by the users are served quickly. For example, if you don't have enough RAM (or enough cache), pages will load slowly, and if you have used eScriptorium before reading this post, you know how important it is to be able to load images fast enough.

RAM is also essential for inference and training because images and annotations are loaded in memory before being passed to the CPU or the GPU. If the RAM is not powerful enough, it will be detrimental to computation and will cause a bottleneck situation. Thus having invested in GPUs and/or CPUs but not in enough RAM would be like having a horse to pull a Ferrari: even if prediction and training could go fast on the processing units, it will be restrained by the available live memory.

Modularity for the CREMMA infrastructure

The CREMMA infrastructure was originally designed by Thibault with a simple but essential principle in mind: modularity. Instead of thinking of an eScriptorium server as a monolithic block of hardware designed for front-end service, storage and intense computation, he suggested to break each of these blocks into individual servers connected together. CREMMA4 is thus made of at least three servers, as shown in the schema below:

  • CREMMA_FRONTEND, for the front-end, where the application is deployed and where the database is stored.
  • CREMMA_STORAGE, for storage, where all the images and models, as well as the backup of the database are stored on the long term. Currently, CREMMA_STORAGE has a storage capacity of 38Tb5 but we could easily add more disks if we find that it is necessary.
  • CREMMA_COMPUTE, where the two A100 GPUs I mentioned earlier are plugged and where the application task manager "sends" all the jobs, whether they are to be run on CPU (these tasks include segmentation and transcription prediction for example), or on GPU (training for the most part).

A model of the CREMMA infrastructure where three blocks (front-end, storage and compute) are connected together through an intranet 10Gb/s connection. For each block, one or two server(s) is presented along with their specification. Credits: Thibault Clérice and Alix Chagué. The full text of the specifications is accessible in a commentary in the source code of this page, just after this image.

As you can see on the schema, there will actually be a fourth server involved in the infrastructure: Traces-6, the server we currently use to deploy eScriptorium at Inria. Like CREMMA_COMPUTE, Traces-6 can be called by CREMMA_FRONTEND for computation tasks. In fact, this is where the modularity of the system is interesting: with such a set-up, it is possible to add more computation servers to the pool of GPUs reachable by CREMMA_FRONTEND without having to redesign the whole infrastructure. On their side, CREMMA_FRONTEND and CREMMA_STORAGE can be upgraded (to add more RAM or more storage) very easily.

This modularity also means that the GPUs remain free for other uses: for example if we were to have to run maintenances on CREMMA_COMPUTE, we can simply cut it from the infrastructure, and let CREMMA_FRONTEND interact with Traces-6 only while we work on CREMMA_COMPUTE.

CREMMA_COMPUTE is equipped with two A100 graphic cards, and Traces-6 with two RTX 6000. Actually, it doesn't mean that only 4 training can be happening at once. Each of these GPUs offer between 24 and 40 Gb of RAM for intense computation. It's a lot. It's so much actually that training a Kraken model at max speed would rarely use more than 40% of this processing power. Virtualization is a nice trick to "break" the GPU down into smaller virtual GPUs (or vGPUs). What is broken down is the RAM capacity. We opted for the following virtualization set up:

  • Each of the A100 graphic cards and their 40Gb of RAM are turned into 1 10Gb vGPU + 5 5Gb vGPUs (since 10+5x5=35, note that we must leave 5Gb out of the equation for the virtualization).
  • No virtualization is applied to Traces-6's RTX6000s.

How did we decide on these numbers? Thibault ran a series of small tests executing either segtrain or train and playing with two different parameters: the batch size6 and the single point precision7. He found that for training a recognition model with a batch size of 8 and either 32 or 16 of precision, less than 5 Gb of RAM on the GPU is enough. With a batch size of 1 and a precision of 32, it's even less than 1 Gb. To train a segmentation model, less than 10Gb is enough, and this type of training is more rare. Since our goal for the infrastructure is not to maximize the speed of the training but to maximize the amount of possible parallel training jobs at decent speed, we decided that 10 vGPUs with 5Gb of RAM and 2 vGPUs with 10Gb of RAM were a good compromise. If we find that more GPU RAM is occasionally needed, we still have two times 24Gb with the RTX6000!

Should you build your own server?

We have spent all this time writing about how to build, how to spec out your server or your infrastructure, but let's talk about the elephant in the room: should you do it?

Well, it's all a matter of perspectives. We'd say it probably makes sense if:

  1. You are a very big organization, you have a lot of money available to you, a super-cluster (and possibly a well staffed IT services department), and you have a high demand;
  2. You are working on very sensitive data that can't be shared with the outside (e.g. medical reports);
  3. You are geographically far away from any other existing server, and face latency issues when you connect to potential welcoming servers;
  4. Servers that exist around you are reluctant to onboard you and the teams behind the request for a server of your own.

These four points are definitely valid. But we'd say that, if you are in another situation, sharing infrastructural costs probably makes way more sense. In our experience, building a server is long, tedious, require special (and rare) skills8 and costly (in terms of human resources as well!). Setting up a working server can take a really long time. For CREMMA, we ended up outsourcing part of the installation of the new infrastructure because we realized that we did not have the time nor the skills to set everything up ourselves. The cost of this installation by a third-party? Between 8 and 12K€, and again, a little time and bandwidth on our end.

Next you have the maintenance fees. You can outsource them, for a little bill from a company which would make sure that everything is installed on time, that updates work well, etc. Or you can do the maintenance yourself. But again, this comes with a cost: human time. A worker on the server goes down? You are in for a few hours. Some people crashed a third-party server by uploading too much IIIF images on your instance of eScriptorium? Well, then you will not only receive emails from these third parties (and this is completely normal), but also have to deal with your user base doing things that eScriptorium allows and that you may not (yet) be able to control/limit.

In the end, we would definitely recommend that, when this is possible, you first consider joining existing servers, including by offering quid pro quo by:

  1. Participating in covering the salary of people maintaining the server (through some kind of yearly fees for example);
  2. Providing some money to expand the existing infrastructure (to increase storage or computation, etc);
  3. In general, helping eScriptorium grow, discussing with the owners of the server you are joining and/or the eScriptorium team about what kind of new functionality should be added, and if you can contribute to fund these updates.

This final point is super important: sure, owning your own server sounds appealing, even if it is costly to put in place. However, developing eScriptorium also comes with expenses. Thus, participating in eScriptorium directly -- we think -- is also very beneficial and welcome by the developing team. Open-source is free to use, free of charge but is not appearing out of thin air: developing costs money. And the more people participate in infrastructural costs (servers or software), the better the experience will be.

  1. If you don't know anything about local servers and are curious to learn more, you can check this page: Or you can also take a look at the corresponding entry in Wikipedia! 

  2. The full reference is: Jean-Baptiste Camps, Nicolas Baumard, Pierre-Carl Langlais, Olivier Morin, Thibault Clérice, et al.. Make Love or War? Monitoring the Thematic Evolution of Medieval French Narratives. Computational Humanities Research (CHR 2023), Dec 2023, Paris, France. ⟨hal-04250657⟩ 

  3. By temporary, we don't mean that the image file are stored for a few hours only, on the contrary, they can stay on the disk for many years. We mean that it should be ok to consider that they can be erased whenever a user is done working on a corpus and has moved away from the transcription phase. 

  4. From now on, "CREMMA" means the server created through the CREMMA project. 

  5. Safety first! We have 38 Tb available, but there is actually a little more physically because we have redundancy and spare. We have 2 series of disks working with redundancy (RaidZ). In each series two disks are entirely dedicated to redundancy only, and one more is completely unused until something fails (it is used as a safety spare disk). While CREMMA_STORAGE, as we said before, is not used as a permanent storage solution, it needs to be a little bit safe for the user base. 

  6. To understand what the batch size corresponds to and why it is important, you can check this entry in the Stack Exchange forum:

  7. To quote Kraken's documentation: "When using an Nvidia GPU, set the --precision option to 16 to use automatic mixed precision (AMP). This can provide significant speedup without any loss in accuracy." Kraken's default value for precision is 32. 

  8. It can be difficult to justify hiring a full-time or even part-time system administrator for a team because it is a very specialized and highly demanded type of profile. For example, public organizations can rarely offer competitive salaries compared to the private sector. In addition, the workload for administrating a web server can be irregular, and it can be difficult to make the skills for system administration meet with other needs faced by a team, complicating even more offering a meaningful full-time job. 

016 - Text Recognition, Large Models and Expectations

Since the boom around ChatGPT almost a year ago, I've heard several people wondering if "tools like ChatGPT" were more efficient than HTR models trained with Kraken and the like. The glimmer of hope in their eyes was most likely lit by their own struggle to set successful and/or efficient HTR campaigns with more traditional tools. The capacity of Large Language Models (LLMs) to reformulate a text1 or, more specifically, of Large Multimodal Models (LMMs) to generate text based on a visual input may indeed lead people to believe that HTR technologies built on CNNs are on the verge of being flipped upside-down.2

Annika Rockenberger recently conducted a series of small experiments on the matter and wrote an interesting blog post about it. Let's summarize it!

She signed up for a premium subscription (25$/mo) to be able to chat with GPT4, which allows users to upload images. Then she submitted printed or handwritten documents she would normally transcribe with Transkribus and assessed the results. She found that GPT4 was fairly good on ancient print (German Fraktur) and that it was even able to follow transcription guidelines if provided with an example. However on a letter bearing handwritten cursive, the model completely hallucinated the content and attempted a transcription in the wrong language. This didn't change when she provided more context on the document. Rockenberger concludes that there is a potential for using ChatGPT for HTR but that the capacity of scaling it up is completely unsure and that learning how to provide good prompts to get the appropriate results is a challenge. I would also add that in the end, Rockenberger paid 25$ to get 10 lines of raw text, whereas with software like Transkribus or eScriptorium, she would also get a standard structured output.

So, in other words, after reading Rockenberger's post, one can conclude that GPT4 (or, better, similar free and open source models) does have a potential for "quick and dirty-ish" OCR. However, I would argue that users tempted by this strategy might still miss an important point: even LMM-based tools will requires a little bit of organization and precision from the users. This, I find, often lacks in unsuccessful HTR campaigns. LMMs could generate a good output, but you will likely have to pay a counterpart one way or the other(s): with lower text recognition quality, with hallucinated text content, with impoverished non-structured output, with premium fees, etc.

Earlier this year, an article proposed by Liu et al. (2023), "On the Hidden Mystery of OCR in Large Multimodal Models", explored almost exactly the same topic but in a more comprehensive way. Their article presents an extensive survey of how well several Large Multimodal Models (LMMs) performed on "zero-shot" tasks.

Zero-shot refers to the act of requesting an output from an LLM or a LMM without training it for this task in particular. It is very similar to Rockenberger's first attempt with GPT4, when she uploaded the image of a printed document and asked for its transcription. In such a case, she relied on the capacity of the model to transfer its knowledge to the specific tasks of Text Recognition, on a specific type of documents (historical printed text).

Other terms are often associated with "zero-shot:" "one-shot" and "few-shot". One-shot is equivalent to Rockenberger's second attempt: when she showed GPT4 an example of the output she expected on the 10 first lines of the documents, and requested that the model copied her strategy to generate the transcription of the 10 next lines. Few-shot would mean showing several pages and several expected output to the model before asking for the transcription of a new document.3

The paper focused on currently available LMMs representing five different approaches for training LMMs:

They evaluated the models on 4 tasks: text recognition, text-based visual question answering, key information extraction and handwritten mathematical expression recognition. Here are a few examples of what these tasks entail, as illustrated in the original article (on the images, P stands for Prediction and GT for Ground Truth):

Task Example
Text Recognition Examples of failed Text Recognition
Visual Question Answering Examples of failed Visual Question Answering
*Key Information Extraction Examples of failed Key Information Extraction
Handwritten Mathematical Expression Recognition Examples of failed Handwritten Mathematical Expression Recognition

For each task, they used several datasets presenting different challenges. For each of these datasets and tasks, they retrieved the scores of the state-of-the-art (sota) for supervised methods and used them as a baseline. For example, for text recognition on the IAM dataset, the sota method of AttentionHTR4 reaches a word accuracy of 91.24%.5 In comparison, Liu et al provide the following scores for the tested LMM on this dataset:

test LMM Score on IAM
BLIP-2 OPT6.7b 38.00
BLIP-2 FlanT5XXL 40.50
OpenFlamingo 45.53
LLaVa 50.40
MiniGPT4 28.90
mPLUG-Owl 42.53
--------------- -----
Supervised SOTA 91.24

The illustrations provided by the article are all of failed attempts, but it corresponds to the overall impression conveyed by the results of the experiments. Indeed, compared to the state-of-the-art supervised methods, zero-shot tasks prompted to LMMs yield results largely outperformed, similar to what is visible in the case of text recognition on the IAM dataset. The only exception is BLIP-2 on a Text Recognition task on a dataset of artistic text (WordArt) which is more challenging. The authors consider that this is a sign that LMMs have a promising potential for visually complex texts.

A very important section of their paper is their remarks on the relationship between LMMs and semantics. Submitting non-word images to the LMMs, they find that the LMMs systematically over-correct the prediction and suggest real-words as an answer. Traditional text recognition approaches, on the other hand, are much less sensitive to the notion of likelihood for the words to recognize. Similarly, the need for semantics interferes with the LMMs' output, and they tend to more easily recognize common words and make up additional letters ("choco" is read as "chocolate"). Lastly, LMMs are insensitive to word length: they are unable to count how many letters are in the image of a word. These results are similar to what Rockenberger experienced with the handwritten letter: the model hallucinated words to compose a semantically plausible letter. But using the wrong date, the wrong names, and the wrong language.

Liu et al conclude their paper reminding us that they experimented with the capacities of the models in the context of zero-shot prompts, whereas there are already successful attempts at fine-tuning LLMs and LMMs on specialized tasks, such as medical prediction. In fact, I think there already exist such attempts in the context of HTR as well: it seems to be the ambition of a model like Transkribus' Text Titan, released at the beginning of the Summer. It is based on a Transformer coupled with an LLM. Unfortunately, I wasn't able to find more information on this model, aside from the community-oriented communications released by Transkribus on their website (here and here).

  1. In stead of a multimodal approach, Salvatore Spina explored the possibility to use a LLM-based tool like ChatGPT3 to post-process the result of HTR and correct the text. See: Spina, S. (2023). Artificial Intelligence in archival and historical scholarship workflow: HTS and ChatGPT (arXiv:2308.02044). arXiv. arXiv.2308.02044

  2. Multimodality is presented by some researchers of the Digital Humanities community as a real epistemological turn for the field. See for example: Smits, T., & Wevers, M. (2023). A multimodal turn in Digital Humanities. Using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections. Digital Scholarship in the Humanities, fqad008. doi: 10.1093/llc/fqad008 ; or Impett, L., & Offert, F. (2023). There Is a Digital Art History (arXiv:2308.07464). arXiv. arXiv.2308.07464

  3. There are a few videos offering more or less detailed explanations on these expressions in the context of prompting an LLM. However, this is not specific to LLM, it is often used in the context of classification or NLP tasks for example. 

  4. Kass, D., & Vats, E. (2022). AttentionHTR: Handwritten Text Recognition Based on Attention Encoder-Decoder Networks (arXiv:2201.09390). arXiv. arXiv.2201.09390

  5. In this case, the WER is used as a baseline to compare different approaches. However, in general, it is not a good idea to only take into account Word accuracy to understand a model's performance in real life. This is something I discussed in this post. 

015 - Block post and comprehensive Exam

When I created this blog last year, I wanted to post regularly on it. Something like once a month or once every other month. I didn't want to put pressure on myself for writing, but I also wanted to make sure that this blog would be alive. I often have ideas for topics for a post. But then, when comes the time to write, I blank out. It's not exactly that I don't know where to start, it's just that I sometimes can't figure out what is the message I want to convey. Like, if I have to summarize my blog post in 2 lines, what's the take-away? I get stuck when I cannot find an answer,but maybe I shouldn't worry that much about it. It's my blog after all, and maybe the message will come by the time I'm done writing.

So, without further ado, let's dive in: I was super excited this Summer after passing my comprehensive exam. I really wanted to write a post about it. I had a really packed Spring and beginning of Summer between going back to Montreal, teaching a class there, attending a Summer school, going to the DH2023 conference in Austria where I presented a short paper, a long paper and organized a workshop (big up to Thibault who was by my sides through all these Austrian adventures). And all of it culminated with that comprehensive exam in the middle of August. I really wanted to share how that went.

But then, vacations, working on new deadlines, more vacations, more deadlines... And now it's already November and I don't know anymore what it was that I wanted to share about that exam. Aside from the fact that I passed it and that it's a pretty big milestone.

The comprehensive examination, which is called "Examen de synthèse" in French, is not something common in France. In France, we now have a sort of yearly evaluation called the "Comité de Suivi Individuel" (or CSI), which is not a scholar evaluation but more of a check-up with your supervisors and a committee1 in charge of making sure that everything is alright. The reason I bring it alongside the Examen de Synthèse is because I also had my first CSI this Summer (at the very end of June). In France, you have to have a positive evaluation from the CSI in order to enroll in a new year of doctoral studies. Each year. But, actually the CSI and the Examen de Synthèse are not really that comparable.

The Examen de Synthèse is a "real" examination and it happens only once during your doctoral curriculum. In my program at the University of Montréal, in 2023, it consisted in several phases.

First of all, there is a phase dedicated to the composition of the jury. I had the pleasure to be examined not only by my three supervisors (Laurent Romary, Emmanuel Chateau-Dutier and Michael Sinatra), but also by Marcello Vitali Rosati, from the University of Montréal, who acted as president, and Maxime Gohier from the University of Quebec in Rimouski. I must signal that my only regret is not to have been able to have a better gender parity in my jury. This is something I really hope to fix for my defense, but I will probably have other occasions to discuss this topic in the future.

So, once the jury is composed, and once a calendar has been agreed on (I think that was actually the most stressful part for me because of all the other things I had this Summer), a count down begins. First, I had to turn in three documents:

  • a 12-15 page-long essay on my research project;
  • a 30-reference long bibliography on the field of the Digital Humanities; and
  • a short presentation of a proposed "practical" analysis.

Then a week later, the jury sent a question.2 I was given 1 week (168h exactly) to think about this question and write a response in the form of a 10-15 page-long essay. The jury had between a week and two weeks to read the response before an oral examination took place (on Zoom).

The oral examination has some similarities with a PhD defense. It started with a 20 minute long presentation that I gave where I summarized my research project (10 minutes) and presented a technical analysis (10 minutes). I chose to focus my technical presentation on an experiment I have been conducting and on which I hope to communicate more in the near future. Then, after my presentation, there were two rounds of questions about my research project, my experiment or about the answer I formulated in my essay.3

I am very happy that such an examination exists in the North American program. It may seem like a lot of stress (and it is), but I found that it is also a very good milestone to progress a lot towards the formalization of a research project. The oral examination is a great opportunity to present a project to people who don't necessarily know what you have been up to before, and it's a really really great occasion to get feedback.

For example, the question that is sent by the jury, in the case of my program, is thought as a way to get you to think about a topic or a question that is either not tackled enough by your research proposal, or it's an invitation to consider new angles. You're not expected to turn in the perfect answer, of course, with barely a week to write it. But it forces you to form an opinion, explore possible hypotheses and may turn later into a whole chapter for your thesis.

The comprehensive exam is a pass/no pass type of examination. There is no grade and if you fail, you can take it a second time. Like I said before at the beginning of this post, I passed. Therefore, starting from Fall 2023, I am now able to enroll as a "en rédaction" student (writing status) which has several consequences. Some seem very symbolic: for example, in English, I can now call myself a PhD candidate instead of a PhD student. But others not so much: tuitions for this new status are much lower than when enrolling as a full-time student, dropping from 1,440$CA/trimester to 512$CA/trimester, and I believe this officially gives me the right to teach at graduate level.

The comprehensive exam also marks the end of the phase during which I had to take courses. Now, with this new status, I am invited to focus solely on the redaction of my thesis, which opens up a whole new chapter for my PhD curriculum.4

  1. I want to take this occasion to also thank Ariane Pinche and Joana Casenave, who were willing to be the members of my committee for the CSI, for their precious feedback! :) 

  2. The question was the following: "Dans votre projet de recherche apparaît une tension importante: celle entre la spécificité des besoins particuliers de chaque projet et la volonté -- et la nécessité -- de produire des approches généralisables, qui puissent être employées dans le cadre de plusieurs projets. En vous appuyant sur votre bibliographie, et en vous concentrant notamment sur le cas du HTR, pourriez-vous analyser cette tension en soulevant en particulier la question de la littératie demandée (notamment dans la gestion des données) pour pouvoir personnaliser des approches computationnelles aussi complexes que les technologies HTR?

  3. I want publish on my blog the documents I created for the comprehensive exam, but I need to find the best way to do it. I'll post an announcement when it will be available. 

  4. Thank you Jennifer for this wonderful pun! ;) 

014 - RT(F)M for the Peraire Experiment

Turns out, there is more to say on last week's experiments on the Peraire dataset! And I found out while I was working on a completely different dataset. Let me explain!

This morning, I helped my colleague train a Kraken transcription model for Greek manuscripts. They gave me the ground truth and I set and executed the training from the command line. It gave me an opportunity to try fine-tuning a model like CREMMA Medieval, in stead of only training from scratch. CREMMA Medieval was trained on manuscripts written in Latin, whereas the Greek manuscripts were written only, well, in Ancient Greek. I didn't want the resulting model to add Latin letters in the transcription when applied to other Greek documents, so I used Kraken's option to allow the model to forget previously learned characters and to force it to only remember the characters contained in the new training data. This option is called --resize (check the documentation here).

When I fine-tune a model, I usually follow Kraken's recommendations and keep both the previously learned characters and the new ones coming from the new set of ground truth. When this morning I checked what is the keyword to use to keep only the characters from the new dataset, I realized that I didn't correctly set the training on Peraire last week. I had set it to only keep the new characters!

Up until Kraken v. 4.3.10, --resize can take the keywords both or add. The ambiguity of these keywords has been discussed in the past, which is the reason why starting from Kraken v. 4.3.10, the keywords respectively become new or union.

Let's quote the manual:

There are two modes dealing with mismatching alphabets, add and both. add resizes the output layer and codec of the loaded model to include all characters in the new training set without removing any characters. both will make the resulting model an exact match with the new training set by both removing unused characters from the model and adding new ones.

I fell for this trap of ambiguity and used both instead of add, thinking both meant I was keep both character sets. (Again this is the very reason why the keywords were recently changed).

Side note: you should really read last week's post to fully understand the rest of this post!

At the end of my post last week, I wrote:

peraire_D on the other hand seems to lose it completely on the B series. This is most likely due to the fact that the contrast between the page and the "ink" is too low in the pencil-written series compared to the data used to train Manu McFrench and in the D series. peraire_D even loses 11 points of accuracy to Manu McFrench!

But how could I be sure that it was not actually due to the fact that the model had unlearned some precious characters?

The only way to know, I thought, was to re-train the models! I used this opportunity to also train the models from scratch because I was curious to see how much noise/improvement was brought by the base model.

I tried 4 types of models and, like last week, used CERberus 🐶🐶🐶 to measure the character error rates on the predictions made on the test sets:

  1. Models trained "from scratch"
  2. A model not trained on any data coming from the Peraire dataset (aka Manu McFrench)
  3. Models obtained from finetuning Manu McFrench using the add resize mode
  4. Models obtained from finetuning Manu McFrench using the both resize mode

For each model trained on the Peraire dataset, I used 3 compositions:

  1. the full dataset ("ALL")
  2. only data coming from the B series ("B")
  3. only data coming from the D series ("D")

I used the same composition system for the test sets.

Here are my results in the form of a table:

a table of the scored obtained on the different train set, test set and resize configurations

Fortunately, it seems that my previous interpretation is not fully contradicted by the results I obtain with this second series of training. Let's focus on two observations:

  1. Whenever a model is trained only on the D series, and tested only on the B series, it appears to be completely incapable of predicting anything but gibberish, losing between 32 and 35 points of accuracy. It confirms that the aspect of the documents from the two series are too different. On the other hand, when the model is fine-tuned on the B series only, it maintains a fairly good accuracy when applied to the D series, whichever resize mode is used. I think it confirms that the B series is enough for the model to learn some sort of formal features from Peraire's handwriting, which the models can transfer to documents written with a different writing instrument.

  2. What is very interesting is the difference between the models trained on the whole datasets and tested on the B series: when we use the both resize mode (meaning we only keep the characters from the new dataset), the model is very good. On the contrary, the performance of the model trained with the add resize mode (meaning we keep the output layer and the codec from the base model and add the new characters) is as bad as with a model trained only on the D series.

In my previous post, I wrote:

peraire_both is able to generalize from seeing both datasets and even benefits from seeing more data thanks to the D series, since it performs better on the B series compared to peraire_B.

However, in the light of my experiment with the resize option, I think this is not correct. Instead, it appears that resetting the output layer by using both (or new) on accident, allowed the model to better take into account the data from the B series (pencil). Contrary to what I observed last week, the model trained on the whole dataset but this time with the add resize mode (or union) doesn't benefit from seeing more data compared to the model trained only on the B series.

My understanding is that keeping the output layer from the base model with add (or union) probably drowns the specificity of the pencil-written documents into a base knowledge tailored to handle documents with a high contrast (like the ones in the D series and in Manu McFrench's training set). Or, to put it differently, when we use both (or new), more attention is given to the pencil written documents, meaning that the model actually gets better at handling this category of data.

I am extremely curious to see how I can investigate this further, or if any of you, readers, would understand these results differently!

013 - The Peraire experiment

WARNING: in my next post, I nuance the conclusions drawn in this post, because of a parameter I didn't correctly set during the training of the models described below. You should really read it after reading this post, to get the full picture!

As a small side project during my phD, I have been sharing my expertise (and a bit of my workforce) with the members of the DIM SPE-VLP project. The acronym stands for "Sauver le patrimoine espérantiste : le voyage de Lucien Péraire (1928-1932)." The project revolves around the digitization, transcription and edition/valorization of Lucien Peraire's archives. He was a French citizen who, in the late 1920s, travelled across the European and the Asian continents, mostly by bike and using Esperanto to communicate. He kept a diary during his journey (and later published a book about his adventures). His notes are written both in French and in Esperanto and in some documents, he also used stenography.

My contribution to the project has mostly consisted in helping developing transcription models for the French diaries (although I'm also interested in the shorthand and the esperanto). This meant both helping with the production of ground truth and training Kraken models. This post will briefly explain how the ground truth was created and published, as well as present the models that were trained with it.

Peraire's notebooks are organized in different series, and each series is divided in ensembles regrouping the pages of a notebook. Each ensemble is named after the countries visited while the notebook was used. For example, notebook 11 in the B series forms one ensemble and covers a part of Peraire's travels in Japan. There are 31 notebooks in the B series. The notebooks of this series are written with a blue pencil on (low quality) school papers. On some pages, the pencil is very faded which makes it hard to read the text, let alone to run a successful segmentation task on the image. On the other hand, the D series gathers notes and comments on the diaries, written at the end of the 1960s. This time the handwriting is much easier to read because Peraire mostly used a blue or black ball-point pen. There are 9 ensembles in this series.

two extracts of Peraire's notebooks side by side, on the left the image is taken from the B series, on the right the image is taken from the D series.

One aspect that I find particularly interesting with this dataset is that we have a case where the handwriting is similar but the writing tool is different. It means that it is possible to explore how the writing tools and/or writing supports affect the efficiency of a transcription model. On top of that, all the documents were digitized under the same (good) conditions and by the same people.

Segmenting, transcribing, aligning and publishing

The first version of the dataset was solely focused on the B series. I selected 1 random page from each ensemble (avoiding to take the first page each time) to compose a train set of 33 files1. On top of that, I selected 4 additional pages from B3, B5, B12 and B18 to compose a fixed test set which would never be used as training data.

I pre-segmented the images with Kraken's default model before correcting the result manually. At this point, I also applied the segmOnto ontology for the lines and regions2. Because of the fading ink, some words could not be transcribed. In order to avoid complicating the transcription rules, I decided to simply segment out the passages that couldn't be read. On the one hand it simplifies the transcription, but on the other hand, it means that a small portion of my segmented documents cannot be re-used by others to train a segmentation model. Since we were not training a segmentation model, it was an easy decision.

screenshot showing the segmentation and the transcription panels from eScriptorium where we can see that some lines are broken down into several segments and that some segments were left blank

More recently, it was decided to augment the dataset with examples from the D series because the model trained on the B series was not good enough. This time, Gilles Pérez, a member of the project, took charge of the transcription. I recommended to create a new sample of 30 to 40 images, so he randomly selected series of 4 continuous pages from each ensemble. The transcription of the corresponding 36 pages was sent to me as a Word document. Therefore, on top of taking care of the segmentation of the images, I also went through an alignment phase during which I verified the order of the lines and copy-pasted the transcription. It took longer than I expected but it allowed me to align the transcription with the rules I had followed when creating the first set. I also picked 4 of the 36 pages to add to the test set.

The dataset is versioned and published applying the principles and tools we developed withing the frame of HTR-United. I also added illustrated segmentation and transcription guidelines.

Testing different dataset configurations to train transcription models

As I mentioned before, the goal of these datasets was to create transcription models. Taking the opportunity of the recent update of the dataset, I tried different scenarios.

I never trained the model from scratch because the dataset is too small to get any sort of usable model. Instead, I used Manu McFrench as a base model, fine-tuned with the Peraire dataset. (We were actually able to use Peraire as an example during the DH2023 conference3 earlier this month to show the usefulness of having this kind of base model). I tested fine-tuning only on the B series, only on the D series or on both the B and the D series. Then I used a B-series-only test set, a D-series-only test set and the full test set to see how the models performed.

Since I wanted to try it after discovering it during DH2023, I used CERberus 🐶🐶🐶 (I talked about it in my last post) to measure the accuracy of the models on the test sets listed above.

Like KaMI, CERberus takes 2 categories of text input: the reference (aka the ground truth) and the prediction (or the hypothesis made by the model). In order to get the prediction, I loaded my models on eScriptorium, as well as the images and transcription of the test set before applying each model to the documents. This way, all the transcription are predicted with the same segmentation, which comes from the ground truth.

Here are the results:

  • Manu McFrench, before fine-tuning, gets a CER of 26.16% when tested on the whole test set, and a score of 27.19% on the documents from the B series, 25.29% on the D series.
  • peraire_both, trained on the B and the D series, gets a CER of 4.63% when tested on the whole test set, but a score of 6.41% on the documents from the B series and 3.54% on the D series.
  • peraire_B, trained only on the B series, gets a CER of 8.72% on the whole test set, but a score of 7.12% on test-B and 9.67% on test-D.
  • peraire_D, trained only on the D series, gets an CER of 16.38% on the whole test set, but this is because of the enormous descripancy between its score on each sub test set. It skyrockets to a CER of 38,53% on test-B while going as low as 3.65% on test-D.

All of this makes sense, though.

  1. ManuMcFrench could not be used without fine-tuning, its error rate on both documents is too high.
  2. peraire_both is able to generalize from seeing both datasets and even benefits from seeing more data thanks to the D series, since it performs better on the B series compared to peraire_B.
  3. peraire_B which was trained on the more difficult dataset seems to use the knowledge inherited from Manu McFrench and to have learned some formal features from Peraire's handwriting since it is able to maintain a fairly low CER on the D series (it gains 16 points of accuracy compared to Manu McFrench).
  4. peraire_D on the other hand seems to lose it completely on the B series. This is most likely due to the fact that the contrast between the page and the "ink" is too low in the pencil-written series compared to the data used to train Manu McFrench and in the D series. peraire_D even loses 11 points of accuracy to Manu McFrench!

What happens with peraire_D is very interesting because it confirms that it is useful to compose a train set with examples of more difficult documents instead of only showing the ones that are easy to read! Now, the nice thing is that I will soon be working on a little experiment with my colleague Hugo Scheithauer where we will be able to measure the impact of the contrast between the ink and the paper. Stay tuned!

EDIT #1: I added the scores obtained by Manu McFrench alone.

EDIT #2: I added a disclaimer at the beginning of the post.

  1. I used 2 images from B2 because one of them was extremely faded and I wanted to include some of these extreme cases in the dataset, and 2 images from B30 because it consisted of shorter lines (table of contents) which I found was interesting to include. 

  2. As described in the documents, I only used the "InterlinearLine" and "DefaultLine" for the lines, and the "MainZone" and "NumberingZone" for the regions. 

  3. See the submission and the slides on HAL:

012 - "It did a very good job"

A few weeks ago, I attended the presentation of an automatic transcription software. The majority of the audience was unfamiliar with the concept of handwritten text recognition (HTR) or had little experience using it. The presentation lasted only an hour, so it couldn't delve into much detail. Its main objective was to demonstrate the software's results. The presenter showed several slides, displaying on one side images of manuscripts (often in a language unknown to the audience) and on the other side the transcriptions generated by the software. Throughout the presentation, the presenter repeatedly commented on the HTR software saying that "it did a very good job."

But what does it even mean?

The very first aspect to explore is what distinguishes a good job from a bad one. Normally, such an evaluation relies on the measurement of the accuracy of the result compared to the ideal transcription. The accuracy can be expressed positively or negatively using the error rates (a 0% error rate is the same as a 100% accuracy).

Measuring the accuracy of a prediction (another way to call the result of HTR) is commonly done at character level. The character accuracy of a model is equal to the number of matches between the prediction and the ideal transcription. The character error rate (CER) is a very common measure to express a model's theoretical efficiency.

Sometimes softwares also consider the word error rate (WER), which is the proportion of words in the prediction containing errors. A high score at WER doesn't actually mean that the transcription is bad. It only means that the errors are distributed on all the words. I never use WER alone because it is hard to get an exact impression of the quality of the prediction based on that metric alone.

There is a paper from Neudecker et al. (2021) where they test 5 different software used for evaluating the prediction. They also develop an interesting reflection on alternative metrics such as the "non-stopword accuracy", the "phrase accuracy", the "flexible character accuracy" (which is useful when the line order isn't always the same), the "figure of merit" (which "aims to quantify the effort required for manual post-correction" (p. 15)) or else the "unordered WER".

When your score is a rate, there is an implicit idea that 100% is both the maximum score and the targeted score (for accuracy of course). But in the case of HTR, 100% accuracy is extremely rare because there are also edge cases where the way a letter was drawn is ambiguous: in such cases the error is not particularly caused by the inaccuracy of the HTR engine but rather by the imperfection of the handwriting in the first place.

In Hodel et al., (2021), the authors provided a grid to interpret accuracy scores. They suggest the following three thresholds:

  • CER < 10% == good (it allows efficient post-processing)
  • CER < 5% == very good (errors are usually focused on rare or unknown words)
  • CER < 2.5% == excellent (but it is usually only reached when the handwriting is very regular)

Personally, I think this grid should also include 20% and 0%. 20% as a threshold, because at 80% of accuracy, the transcription is supposedly good enough for fuzzy search and keyword spotting (I should add a reference here, but I can't find it anymore...); and 0% because it should be reminded that an accuracy of 100% is virtually impossible.

To complement this, I would like to mention another possible approach to get an interpretable score: during the DH2023 conference, Thibault Clérice and I presented an experiment where we trained a model using the same data in the train set and the test set. Our model reached an accuracy close to 90%, which we were able to use as a baseline to define the highest accuracy score possible for the data we had. Thus we were able to consider that a model approaching 90% of accuracy would be an excellent model, as far as that dataset was concerned.

Still during the DH2023 conference, Wouter Haverals introduced CERberus 🐶🐶🐶, a web interface which addresses the same type of issues as KaMI: the lack of nuance in a plain CER computation. Indeed, in a CER score, every type of error has the same weight. This means that mistaking an "e" for a "é" costs the same as mistaking a "e" for a "0": in the first case the text is likely still readable or understandable, whereas in the latter, it might not be the case.

The CER metric is still very useful, but when applied to transcription projects, it is even more valuable when we can filter the types of errors we want to include in the evaluation.

EDIT: I should have noted here that my reflection was focused on the evaluation of an automatic transcription in cases where you already have the expected transcription. When we apply an HTR model to a whole new set of documents, we usually don't have the correct transcription at hand (otherwise we wouldn't use HTR in the first place). This is the reason why many researchers try to find ways to evaluate the quality of the transcription without ground truth. One example can be found in Clérice (2022).

So, to go back to our initial problem, we can see that there are many ways to draw the line between a good job and a bad one. The threshold will depend on the metric used to express the accuracy of the prediction and also (and actually mostly) on the way the generated text will be used down the line. Even though the software presentation I attended was short, I think we should always remind future users of HTR that 100% of accuracy is not always what they are seeking.

A short reflection to finish this post: I was bothered by the expression used to qualify the transcription. I am still trying to figure out a way to put it into words. On top of lacking accuracy, the expression "it did a good job" was also calling for a vision of HTR as a magic tool at the service of the searchers and students. But, in which other cases do you say that someone did "a good job?" Likely when you delegate a task to a subaltern.

I see a problem here: in their current state, HTR engines are efficient but not to the point that people can use them without thinking clearly about what they want the engine to produce. It is easy to sell a software pretending that it is a magic servant that will do all the transcription in your place, a tool so smart that you can even consider delegating a part of your responsibility to it. But I think when new users of HTR fail to first reflect on the outcome they can reasonably expect from these engines, it creates disappointment and crappy data and workflows.

011 - Working with synthetic data

What we call synthetic data are data generated artificially, as opposed to data taken from real-life samples. In the case of automatic transcription or layout analysis, it corresponds to creating fake documents or samples of text that look more or less like real ones, in stead of manually annotating existing documents.

One of the main advantages of using synthetic data rather than real data is the fact that it comes already annotated. For automatic transcription for example, the annotation (transcription) is the same as the string of text passed to a text image generator. If you add to that the fact you can, in theory, generate an unlimited amount of pairs of text image and transcription, it represents an incredible opportunity to accelerate the production of training datasets. An example: Doush et al., 2018 use this technique to generate PDF containing contemporary printed Arabic texts. The PDFs are printed, then re-scanned and aligned with the transcription that was used to generate the PDFs. The result is the Yarmouk dataset. As we will see later, generating fake handwritten text is a bit more difficult.

Another advantage of this technique is that it offers an efficient way around the limitations posed by sensitive or confidential data (Hu et al. 2014). However, let's note that confidentiality is rarely a problem when it comes to training HTR models on historical documents.

Generating fake data is not specific to computer vision (Raghunathan, 2021), even though it is frequently used in this case because data for computer vision tasks are costly to produce. In general, it is a fairly frequent method when machine learning techniques are involved, disregarding the field of application (Kataoka et al., 2022). OCR and HTR tasks are not an exception and we can find traces of such experiments rather early (Beusekom et al., 2008).

The first time I was exposed to the notion of synthetic data was during a informal conversation with Tom Monnier in 2019. At that time, he was working on docExtractor, a layout analysis tool that he trained with images of documents generated artificially.1

Then sometimes in 2021, while browsing through HuggingFace's spaces, I found ntt123's application that simulates handwriting. The application takes a text prompt as an input and generates an animation where the letters are traced on the page as if someone was writing them live. It's possible to play with two parameters: a value between 0 and 250 determining the writing style, and a weight determining the likelihood of the traced letters (the lower the weight, the higher the risk of hallucinated letters; the higher the weight, the more standardized the tracing). It made me think back to my conversation with Tom Monnier and I wondered if it could be used to generate pairs of text and images.

At the beginning of the year, I dedicated a good part of my time to testing data generation tools I could find online, to see if they could be used to create a set of fake ground truth that I would use later, in other experiments. I will introduce the latter in a future post, so let's first focus on handwritten data generation.

When I dug a bit more around ntt123's application, I was confronted with two things:

  1. unfortunately, ntt123's application was developed in javascript and not documented at all which made it impossible for me to hack,
  2. but luckily, it wasn't an original idea: instead it was one of many implementations of a proposition introduced by Alex Graves in 2014.

Alex Graves uses online2 data from the IAM database (Liwicki & Bunke, 2005) and an LSTM (Long Short-Term Memory) to train a model capable of generating series of coordinates that trace letters and words. Initially, the model simply generates random series of letters and words, but it is then improved to take into account a text prompt which forces the models to generate a specific series of letters. As described before, the model also takes a weight (or bias) which normalizes the likelihood of the letters' shape, and can take a "priming line": the image of a handwritten line, whose writing style the model will try to copy. Once the coordinates are generated (including key information such as "ends-of-stroke"), it is easy to place them in an SVG file and visualize the result, with or without animation.

There are many many implementations of Alex Graves's experiment because it was such an important publication to demonstrate the usefulness of LSTM models. Several can be found on Github if you search "Alex Graves". For my experiment, I didn't want to develop my own adaptation of such a model, but rather to use programs that were ready to be used. This is the reason why I didn't look for papers but instead for recent (or recently updated) repositories on Github. I focused on Python programs because I wanted to be able to understand how they were developed.

One very promising implementation of Alex Graves' proposition was Evgenii Dolotov's pytorch-handwriting-synthesis-toolkit. It came with pre-trained models, and a utility scripts to feed the program a text prompt and generate an image. I customized3 the program a bit to fix a few bugs and try to make it generate several lines maintaining the same handwriting.

4 lines stating (or supposed to state) 'did a computer write this' generated by Evgenii Dolotov's program. The fourth line is a failed attempt where several letters like y, n, m can be dinstiguished. The fifth line states 'determined to act upon the assumptions' but contains several garbled letters.

Even though the generated images were sometimes impressively realistic, it created a lot of bad output. As suggested by Alex Graves, his solution tends to generate what he calls "garbled letters", letters that no human would likely trace. In other cases, it would randomly skip some letters and be completely incapable of tracing some numbers or punctuation signs. Sometimes, the model would simply draw more or less flat lines. Since I wanted to generate fake gold data that I could trust and since the results were not reliable enough, I played with the bias and the priming lines before trying to train new models using Evgennii Dolotov's utility scripts. I failed to get better results than the pre-trained models, and failed to find the correct parameters to make sure I would obtain always realistic output.

several flat lines that at one point successfully write 'is fin'. This is a failed generated image.

At this point I started exploring GANs (Generative Adversarial Networks) which are models based on game theory. They are capable of generating realistic fake images learning from samples of real images (see Goodfellow et al., 2020). They are this kind of models used to generate photos of people who don't exist. There are Github repositories offering source code to train such models to generate fake handwriting, such as GANwriting (described in Kang et al., 2020) or Amazon's ScrabbleGAN (introduced in Fogel et al., 2020) but they were only giving instructions to reproduce the corresponding papers and train the models ourselves. Since GANs are costly to train, I left this option out for the moment, even though I do think they can become an interesting solution in the future.

Eventually, I settled for a solution based on a Diffusion model. This type of model can be found behind applications like OpenAI's DALL-E. Luhman & Luhman (2020), who created the Diffusion Handwriting Generation (later called DHG), explain very well how diffusion models work.

"Diffusion probabilistic models [...] convert a known distribution (e.g. Gaussian) into a more complex data distribution. A diffusion process converts the data distribution into a simple distribution by iteratively adding Gaussian noise to the data, and the generative model learns to reverse this diffusion process." (Luhman & Luhman, 2020, p. 1)

A great advantage with DHG compared to the LSTM approach was that it was possible to easily fix the priming line and almost always obtain a convincing output. This was essential to create a dataset with a consistent handwriting over hundred of lines. As visible in the following image, even if the diffusion model is not capable of perfectly imitating the handwriting contained in the priming line, it usually successfully captures elements of style such as the slant, or the cursive nature of the text.

five pairs of priming lines with the resulting generated lines.

After several tests, I found that the third priming line gave the best results when associated with different text prompts, so I decided to use it along with excerpts from Moby Dick to create a completely artificially generated dataset. In a few days, I created more than 8,000 images (PNG) associated with a text file (TXT) containing the prompts used to generate them.

These pairs could have been used "as is" to produce a silver synthetic dataset but, like I said before, I needed a gold dataset where the text and the images would be exact matches. Unfortunately, more than a third of the images did not qualify as gold. After manually reviewing about 2,500 of the lines (with the help of my colleague Hugo Scheithauer), we published a set of 1,280 pairs of lines and text under the name "Spinnerbait".

Even though I was able to produce a dataset meeting my main criteria, I was actually disappointed with my results: I wanted a sort of magic button which would allow me to generate, at any time and without having to review it, a perfect set of training data. Instead, in the future, if I want to add more lines to Spinnerbait, I will have to spend a few hours going through each line to filter the bad ones out.

On the other hand, I decided to take a few hours to manually copy a text taken from Guillaume Apollinaire's poems. I copied the text following a txt file that I would edit every time I would start a new line, I scanned it, segmented it with eScriptorium before copying and pasting the lines from the txt file and exported the result as a series of XML ALTO and images. It gave birth to the Moonshines dataset, a set of 1,186 lines (including 170 dedicated to a fixed test subset) of a single hand, thus comparable in size to Spinnerbaits.

I think generating both datasets took about the same amount of time, if I take into account on the one hand reviewing the generated lines and on the other hand copying the text and passing it through eScriptorium. Moonshines used less computing resources and produced a richer dataset if we consider the aspect of the text. Also, the length of the lines is more varied in Moonshines whereas it is more homogenous (max 5 words) in Spinnerbait, because the generator tended to make more errors on longer prompts.

a line taken from the Spinnerbait dataset and a line taken from the Moonshines dataset

Another important limitation that I have barely addressed at this point it that not only do these tools fail to draw non-ASCII characters, but they also tend to have a greater chance of producing garbled letters when prompted with rare or non-english words4. This is true of all the systems I have tested. Of course, we could imagine training new models on data containing a greater diversity of languages, or simply other scripts or languages.

As way of a conclusion, I would say that even though I was disappointed with what I obtained down the line, this exploratory adventure was very interesting. I learned a lot and I am convinced that if I had more time and resources (and if it were more crucial for me), I would have found a way to get better results. I know of some coming publications that used GANs to create artificial data that look like lines taken from historical documents and I really look forward reading them.

  1. It is possible to find here a recording of the talk given on this tool at ICFHR 2020. 

  2. In the context of handwritten text recognition, a distinction is made between "online" data and "offline" data. Offline data are based on a matrix of pixels containing the image of a text (they are static), whereas online data are vectors containing information about the speed, the points through which a line passes to form a letter, end of stroke points, etc. Online HTR uses data generated with an e-pen and a screen while offline HTR uses images created with a scanner or a camera. 

  3. One of the customizations consisted in removing non-ASCII characters or characters not supported by the model. It was easy to apply this transformation because in pytorch-handwriting-synthesis-toolkit, each model comes with a little metadata file which contains the character set handled by the model. 

  4. All the models were trained using the IAM database, more often the "online" database, but sometimes also with the "offline" version. 

010 - Make and Read the docs

During my last contract as a research engineer at Inria, I spent a lot of my time working on the project called LECTAUREP, in collaboration with the National Archives in France. The goal of this project was to explore new ways to index the content of thousands of thousands of notary registries which, put together, form one of the most used collections of the National Archives. I joined the project at the end of 2019, during its second phase, almost at the same time as eScriptorium was initiated. LECTAUREP had worked with Transkribus during the first phase (in 2018) but, given the connections between my research team and the team behind eScriptorium, we quickly switched to the newer software and contributed to its development.

One of my most important contribution is the redaction of a tutorial for the software, which was initially only intended as an internal resource for our team of annotators. This is the reason why the tutorial was published on LECTAUREP's blog. OpenITI, and in particular Jonathan Allen rapidly offered an English translation which, eventually, was also published on LECTAUREP's blog. Since the publication of this translation, it is listed on eScriptorium's home page as its official tutorial.

Unfortunately, the tutorial hasn't been updated in a long time whereas major updates and new features have been added on eScriptorium's side.

LECTAUREP's blog is not a good solution. It is built with Wordpress and hosted by Hypotheses which is very convenient to allow a small, well defined, group of people to collaboratively work on a research blog, but it's too heavy and not adapted to publish the documentation of a software like eScriptorium. The documentation needs to be updated frequently to keep up with the software and, in general, a blog is not a place to publish the extensive documentation of a software. To top it all, it is not even that easy to update for me, so can you imagine someone outside of LECTAUREP trying to offer an update?

I have been thinking of finding a better solution since at least 2020, but it was never so urgent that I was able to put it at the top of my to-do lists. Last Summer, I took the advantage of a rather slow couple of weeks in August, when every one but me seemed to have gone on vacations, to put something different in place.

Readthedocs quickly appeared to me as an ideal solution: the platform is designed for publishing software documentations, it handles software versions and multi-lingual contents. Last but not least, it uses static website generators. This is fundamental because it allows for the publication of the source code on a platforms like Github and will actually use this public source code to build the website.

Github is a platform designed for sharing and opening codes to external contributors. Relying on it solves a major issue with the current tutorial: if anyone can suggest the correction, edition or translation of eScriptorium's documentation, then it is more likely to keep up with the evolutions of the application!

In August, I created a new Github repository called escriptorium-documentation. I set a basic configuration and connected it to Readthedocs. As soon as this was done, the website became available at online with a URL based on the following structure: {gh_repo_name} Then, I started rewriting the content of the tutorial... following Sphinx' syntax.

It was so painful that I never got back to it after I came back from my own vacations.

Why painful? Well, I had discovered Markdown in 2017 and I have used it since. It's so powerful and yet so light! In comparison, Sphinx felt like such a complicated and heavy syntax. Not as heavy as HTML, but less intuitive nonetheless. I had to go through the documentation every time I wanted to add something as simple as a hyperlink or an image!

In January, when I gathered enough motivation1 to go back to working on eScriptorium's tutorial, I decided to look for an alternative to Sphinx compilers.

The only non-sphinx-based option available with readthedocs is Mkdocs. Like its name hints at, Mkdocs is a Markdown compiler, capable to quickly build websites. The set-up is really quick, it's well documented, fairly easy to customize and it's possible to add a lot of cool extensions which are based on Python. It was the bomb!

I liked Mkdocs so much that I also used it to rebuild my personal website!2

Over the past month, I have spent a lot of time working on this new tutorial for eScriptorium. I designed a basic structure, breaking down the features into different categories. Now the pages are progressively being filled and I am very happy to have been joined in my efforts by my colleagues Hugo Scheithauer and Floriane Chiffoleau. As we progressively merge the content of new pages to the main branch, the escriptorium-tutorial website expands. It will be ready soon for an official release!

I really hope that the transparency and simplicity brought by Mkdocs and Markdown will allow many people to add their contributions to the documentation of eScriptorium! Who knows, maybe you will too!

EDIT: we changed the name of the repository to escriptorium-documentation instead of escriptorium-tutorial (all links and mentions were changed in this post). The decision was motivated by the fact the "tutorial" felt like an inexact description of the actual scope/ambition of the project.

  1. Also when I got more free time after my classes were over

  2. It is not necessary to use readthedocs to deploy a website built with Mkdocs. In the case of the tutorial, it simply allows us to have a domain name more meaningful than "". 

009 - Looking back to 2022

Over the past decade, I've started a tradition of taking a sort-of-day-long hike on the 1st of January. There is always a chance that the weather won't be on my side, but I like the idea of starting the year peacefully and embracing nature.

Before starting 2023 though, and without much originality, I wanted to dedicate a last post for 2022 to looking back to my first full year of phD and focus on its biggest highlights as far as my phD project is concerned.

Of course, the year sure didn't look like anything I had expected! I don't think I had envisioned any of what my trips to Montréal have brought me... but also how much they consume of the time available to actually work on my research project. As you will see, it is not necessarily a bad thing - I actually want to focus on the good parts - but it doesn't mean that it was easy to wrap my head around it.

I spent two thirds of the 2022 in Montréal: 8 months out of 12, with what looks right now like a super short 4 months of Summer in France in the middle. I specify that it was Summer because, you know, ... July and August, they're not usual months.

Overall, 2022 passed in the blink of an eye.

I spent a lot of time in class or preparing for class or working on final projects. Indeed, I took 4 classes out of the 5 required by the Université de Montréal, and attended 1 as an auditor.

  • SCI6304 was a bibliometry class taught by Vincent Larivière;
  • MSL6523 was a Digital Museology class conducted by Emmanuel Chateau-Dutier;
  • LCO6525 was a Compared Chinese Literature class given by Victoria-Oana Lupascu;
  • HNU7000 was a class focused on the Epistemology of the Digital Humanities lead by Michael E. Sinatra;
  • and SCI6203, the extra class, was about AI and textual data, it was given by Dominic Forest.

As you can see, they were very diverse but I think I learned quite a few things from them that I will be able to use more or less directly for my research.

  1. In HNU7000, we often had to write or present critical summaries of articles or conferences and in LCO6525, we were asked to turn in a "annotated bibliography" (at least 400 words for 6 references related to the topic of our final paper) halfway through the semester. With these exercises, I think I found a better way to summarize articles or books I read. I can still improve the "critical" aspect of the exercise, but I think I got better at summarizing what was actually relevant to me when I read. Now, I hope to publish some of these future summaries here.

  2. In SCI6304, my final project consisted in a bibliometric investigation on the publications on HTR over the past 40 years. I would like to publish the result as an article, which is a project for 2023. But I can already see that doing so, I gained a much better understanding of the field(s) related to my research project, I found some keywords that I need to investigate and I added a bunch of references to my potentially-to-read list.

  3. In MSL6523, the final poject consisted in a blog post for Museonum. Mine was focused on the current use of automatic transcription and crowdsourcing by patrimonial institutions. I have already been able to reuse this post as a reading suggestion for a class where I was invited to present eScriptorium.

I had a lot of other activities throughout the year (teaching, presenting at conferences, etc), but, as far as my thesis project is concerned, two of my biggest concerns were 1) reading and 2) refining my research project.

I tried different strategies to read more, but I am a slow reader and my schedule is often fragmented. I have started finding the beginning of a solution in keeping one day of the week totally without meetings, and by always keeping it the same day. I started implementing this during the past semester and it seemed to work: Wednesday was the day I was going to the library or simply dedicating to my homework. On Wednesday, I never scheduled meetings (unless absolutely necessary). Now that I am back in France for 4 months, I need to make sure I am able to keep this routine. On the other hand, I am trying to solve the speed problem by better sorting the references I end up reading in order not to waste time on a text that was actually not a priority.

One of the things I learned in 2022 and that affects my reading is the usefulness of expanding the types of publications I read. Let me explain. Working at ALMAnaCH, I saw most of my colleagues focus on reading articles and pre-prints. The main objective there is to keep up to date with a state of the art that evolves quickly and for a discipline that is fairly recent. On the other hand, at the Université de Montréal, more emphasis is placed on the conceptual frame of the research. Therefore, many more books and chapters, of disciplines that are not always the same as the envisioned project, make their way into the bibliographies. It might seem obvious to some of you, but it wasn't for me at the start of the year and this is exactly the type of enrichment I was looking for when I went for a cotutelle.

Now, as far as refining my research project is concerned, this is something that I didn't start solving before December. At the Université de Montréal, when I apply for a grant, I often have to introduce my research project, including my methodology and the conceptual frame. Writing these texts, it often made me feel like I was not progressing in terms of making my project more precise. I had a general idea of the topic I wanted to work on and which issues I wanted to address in general, but no clear strategy.

I often thought about it during the year, looking for a way to solve this problem. It wasn't a question of narrowing down my topic or use-case, I think mine is/are already well defined. It is only at the occasion of a 20 minutes presentation assigned during the HNU7000 seminar that I sat at a table and put together my thoughts and the results of my discussions with my supervisors. I will soon dedicate a post to the current state of my research project, but what I think I was lacking the most was an angle to efficiently narrow down the scope of my project and justify future choices.

Even though I started focusing on how different 2022 looked compared to what I had imagined, I am actually very happy with this year. I met a ton of people who have broadened my horizons on research and academia in general. And I start 2023 full of ideas for my phD and this blog! Keep an eye out for them in future posts!

008 - A Canadian Grail

June 2022 witnessed a pretty cool planet alignment, but it seems this omen wasn't about my registration to the RAMQ for my second semester in Canada. I didn't mean to post again about Health Insurance in Canada, but I can't resist telling this story.

Building from my experience (and mistakes) during Winter 2022, I thought that the process would be a piece of cake from then on. Uh-uh, not so fast.

"At least, I started right from the beginning with the correct form!"1

True. However, I got too confident in the speed of the process and forgot to take into account the French Summer vacations. I didn't send the signature request to my university before the 1st of August (a little over a month before my departure). The people in charge of signing were already on vacation. While this step had been very quick in Winter, this time it took almost a month: I only got the form with its signature back on the 30th of August, less than a week before my departure.

I immediately printed the document and rushed to the Post office to send it to the Assurance Maladie. I still had one month and two weeks to finish the full process in order to remove some 300$CA fees from the bill at the University of Montréal. It seemed feasible. In December, my initial envoy to the Assurance Maladie had seemingly never made it to the Assurance Maladie, so this time I decided to play it safe: I printed the document myself2 and sent it as a registered letter, with a tracking number and a return receipt request.

How glad (and despairing) was I! Thanks to this tracking number, I was able to see that my letter... never left that Post office. After a week and a half, I sent a claim to La Poste, but since I couldn't wait for them to lead an investigation3, my partner had to print the form a second time and send it again (I was already in Canada at that time).4

Phew! The letter finally made it to the Assurance Maladie. But the time frame was narrowing down quickly! I had to contact the Assurance Maladie via the mobile app to try to politely urge then to send it back to me quickly. Eventually, I received it duly signed on September 27. I had two weeks left to send it to the RAMQ and get an acceptation letter to be able to reduce my bill.

Wasting no time, I praised the RAMQ for allowing me to upload the file(s) on their website, but cursed them when I received a confirmation notice telling me that they would take up to 60 days to process it. 60 days! I only had 17 left!

At this point, as far as my bill was concerned, I had two options:

  1. Pay the 300$CA and hope for the process to be over by the 15th of November (to have it refunded to me as a credit from UdeM on my next tuition5);

  2. Or refuse to pay that part of the bill and begin having a debt of 300$CA towards the University, potentially pay interests on it and hope for the process to be over by the 15th of November (to have it retroacctively removed from the bill).

This semester, I was only finally getting my 600$CA from Winter, because in February I had chosen the first and safest option. This time, I took a gamble and went for the second scenario. I wasn't sure of the actual consequences of contracting a debt with the University, but I was holding on the hope that the RAMQ wouldn't take as long as they announced.

October seemed to go by really slowly.

The first sign of life from the RAMQ reached me on the 26th of October. Unfortunately, in a letter dated from the 14th, they listed all the items they expected to find in the form and rejected my request until I sent a complete form ASAP.

"Wait, what? My form, not complete?!"

Was it possible I used the wrong form again? No, it wasn't that... Did I forget to check a box somewhere? No... Did the Assurance Maladie not correctly fill their part? No, still not that. What was wrong then? The only way to find out at that point was to call the RAMQ. I did so the following day.

I'll skip the hour-long wait on the phone with (or without) music and, boom! Here is the answer: somehow, they only received the last page of the PDF I uploaded on their website. But twice. Twice the last page. The first page, nowhere to be seen!

I refused to let a bug cost me 300$CA, so I begged the operator to let me have a way to send the missing page faster than the Post Office, and safer than their glitching upload platform. This is how I found myself (with my visiting partner) at the nearest Uniprix, playing with their public printer and sending my form... by fax!

The waiting resumed.

I had no idea how long it would take. Did that fax even reach the operator?

I was ready to give up on my 300$CA. I was losing hope that I would ever get an answer on time. I was cursing the University of Montréal for making it so stupidly complicated to remove an illegitimate amount from my bill, cursing the RAMQ and the Assurance Maladie for being so slow and not having automatized such a process with a shared plateform, cursing La Poste for being unreliable and losing my letters, cursing myself for not starting the process sooner, had I known! Raaaah!

But on November 4, suddenly, unexpectedly, there it was. My Canadian Grail. The letter of acceptation into the RAMQ.

With trembling hands, unable to erase a smile from my face, blinking once or twice, I captured it with my phone, and connected to the Task Manager on the University of Montréal website.


Press "Confirm"

And, a few days later poof! gone, my 300$CA debt.

Only remained, the 2,56$CA of interests generated during these three weeks, on a debt that should have never existed.

But I surrender, I won't fight any longer.

I'll pay this time.

  1. Aka, I didn't use SE-401-Q-104, nor used the outdated version of SE-401-Q-106. 

  2. La Poste offers a service consisting in allowinf you to upoload a PDF which they will print and send for you. I used in December due to the difficulty to find a open Post office during the Christmas period. 

  3. The investigation took less than 2 weeks if I remember correctly. But I was right to make another envoy in parallel because they never found the letter and eventually sent me a refund. 

  4. Yes! It is the second time La Poste loses my SE-401-Q-106... How unlucky is that? 

  5. As I explained in my the post n°006, this amount is not directly refunded to you. It only appears as a debt to you from the University and will only be deducted from your next bill. It's unconvenient for two reasons: 1) it's a credit you end up giving the University and given the amount, you could really use that money for something else (like buying food for a month or paying rent), 2) when you don't pay tuition every semester (as is my case), you have to wait a really long time for that money to make it back to you!