Turns out, there is more to say on last week's experiments on the Peraire dataset! And I found out while I was working on a completely different dataset. Let me explain!

This morning, I helped my colleague train a Kraken transcription model for Greek manuscripts. They gave me the ground truth and I set and executed the training from the command line. It gave me an opportunity to try fine-tuning a model like CREMMA Medieval, in stead of only training from scratch. CREMMA Medieval was trained on manuscripts written in Latin, whereas the Greek manuscripts were written only, well, in Ancient Greek. I didn't want the resulting model to add Latin letters in the transcription when applied to other Greek documents, so I used Kraken's option to allow the model to forget previously learned characters and to force it to only remember the characters contained in the new training data. This option is called --resize (check the documentation here).

When I fine-tune a model, I usually follow Kraken's recommendations and keep both the previously learned characters and the new ones coming from the new set of ground truth. When this morning I checked what is the keyword to use to keep only the characters from the new dataset, I realized that I didn't correctly set the training on Peraire last week. I had set it to only keep the new characters!

Up until Kraken v. 4.3.10, --resize can take the keywords both or add. The ambiguity of these keywords has been discussed in the past, which is the reason why starting from Kraken v. 4.3.10, the keywords respectively become new or union.

Let's quote the manual:

There are two modes dealing with mismatching alphabets, add and both. add resizes the output layer and codec of the loaded model to include all characters in the new training set without removing any characters. both will make the resulting model an exact match with the new training set by both removing unused characters from the model and adding new ones.

I fell for this trap of ambiguity and used both instead of add, thinking both meant I was keep both character sets. (Again this is the very reason why the keywords were recently changed).

Side note: you should really read last week's post to fully understand the rest of this post!

At the end of my post last week, I wrote:

peraire_D on the other hand seems to lose it completely on the B series. This is most likely due to the fact that the contrast between the page and the "ink" is too low in the pencil-written series compared to the data used to train Manu McFrench and in the D series. peraire_D even loses 11 points of accuracy to Manu McFrench!

But how could I be sure that it was not actually due to the fact that the model had unlearned some precious characters?

The only way to know, I thought, was to re-train the models! I used this opportunity to also train the models from scratch because I was curious to see how much noise/improvement was brought by the base model.

I tried 4 types of models and, like last week, used CERberus 🐶🐶🐶 to measure the character error rates on the predictions made on the test sets:

Models trained "from scratch"
A model not trained on any data coming from the Peraire dataset (aka Manu McFrench)
Models obtained from finetuning Manu McFrench using the add resize mode
Models obtained from finetuning Manu McFrench using the both resize mode

For each model trained on the Peraire dataset, I used 3 compositions:

the full dataset ("ALL")
only data coming from the B series ("B")
only data coming from the D series ("D")

I used the same composition system for the test sets.

Here are my results in the form of a table:

a table of the scored obtained on the different train set, test set and resize configurations

Fortunately, it seems that my previous interpretation is not fully contradicted by the results I obtain with this second series of training. Let's focus on two observations:

Whenever a model is trained only on the D series, and tested only on the B series, it appears to be completely incapable of predicting anything but gibberish, losing between 32 and 35 points of accuracy. It confirms that the aspect of the documents from the two series are too different. On the other hand, when the model is fine-tuned on the B series only, it maintains a fairly good accuracy when applied to the D series, whichever resize mode is used. I think it confirms that the B series is enough for the model to learn some sort of formal features from Peraire's handwriting, which the models can transfer to documents written with a different writing instrument.
What is very interesting is the difference between the models trained on the whole datasets and tested on the B series: when we use the both resize mode (meaning we only keep the characters from the new dataset), the model is very good. On the contrary, the performance of the model trained with the add resize mode (meaning we keep the output layer and the codec from the base model and add the new characters) is as bad as with a model trained only on the D series.

In my previous post, I wrote:

peraire_both is able to generalize from seeing both datasets and even benefits from seeing more data thanks to the D series, since it performs better on the B series compared to peraire_B.

However, in the light of my experiment with the resize option, I think this is not correct. Instead, it appears that resetting the output layer by using both (or new) on accident, allowed the model to better take into account the data from the B series (pencil). Contrary to what I observed last week, the model trained on the whole dataset but this time with the add resize mode (or union) doesn't benefit from seeing more data compared to the model trained only on the B series.

My understanding is that keeping the output layer from the base model with add (or union) probably drowns the specificity of the pencil-written documents into a base knowledge tailored to handle documents with a high contrast (like the ones in the D series and in Manu McFrench's training set). Or, to put it differently, when we use both (or new), more attention is given to the pencil written documents, meaning that the model actually gets better at handling this category of data.

I am extremely curious to see how I can investigate this further, or if any of you, readers, would understand these results differently!

013 - The Peraire experiment

Alix Chagué

2023-07-28

WARNING: in my next post, I nuance the conclusions drawn in this post, because of a parameter I didn't correctly set during the training of the models described below. You should really read it after reading this post, to get the full picture!

As a small side project during my phD, I have been sharing my expertise (and a bit of my workforce) with the members of the DIM SPE-VLP project. The acronym stands for "Sauver le patrimoine espérantiste : le voyage de Lucien Péraire (1928-1932)." The project revolves around the digitization, transcription and edition/valorization of Lucien Peraire's archives. He was a French citizen who, in the late 1920s, travelled across the European and the Asian continents, mostly by bike and using Esperanto to communicate. He kept a diary during his journey (and later published a book about his adventures). His notes are written both in French and in Esperanto and in some documents, he also used stenography.

My contribution to the project has mostly consisted in helping developing transcription models for the French diaries (although I'm also interested in the shorthand and the esperanto). This meant both helping with the production of ground truth and training Kraken models. This post will briefly explain how the ground truth was created and published, as well as present the models that were trained with it.

Peraire's notebooks are organized in different series, and each series is divided in ensembles regrouping the pages of a notebook. Each ensemble is named after the countries visited while the notebook was used. For example, notebook 11 in the B series forms one ensemble and covers a part of Peraire's travels in Japan. There are 31 notebooks in the B series. The notebooks of this series are written with a blue pencil on (low quality) school papers. On some pages, the pencil is very faded which makes it hard to read the text, let alone to run a successful segmentation task on the image. On the other hand, the D series gathers notes and comments on the diaries, written at the end of the 1960s. This time the handwriting is much easier to read because Peraire mostly used a blue or black ball-point pen. There are 9 ensembles in this series.

two extracts of Peraire's notebooks side by side, on the left the image is taken from the B series, on the right the image is taken from the D series.

One aspect that I find particularly interesting with this dataset is that we have a case where the handwriting is similar but the writing tool is different. It means that it is possible to explore how the writing tools and/or writing supports affect the efficiency of a transcription model. On top of that, all the documents were digitized under the same (good) conditions and by the same people.

Segmenting, transcribing, aligning and publishing

The first version of the dataset was solely focused on the B series. I selected 1 random page from each ensemble (avoiding to take the first page each time) to compose a train set of 33 files¹. On top of that, I selected 4 additional pages from B3, B5, B12 and B18 to compose a fixed test set which would never be used as training data.

I pre-segmented the images with Kraken's default model before correcting the result manually. At this point, I also applied the segmOnto ontology for the lines and regions². Because of the fading ink, some words could not be transcribed. In order to avoid complicating the transcription rules, I decided to simply segment out the passages that couldn't be read. On the one hand it simplifies the transcription, but on the other hand, it means that a small portion of my segmented documents cannot be re-used by others to train a segmentation model. Since we were not training a segmentation model, it was an easy decision.

screenshot showing the segmentation and the transcription panels from eScriptorium where we can see that some lines are broken down into several segments and that some segments were left blank

More recently, it was decided to augment the dataset with examples from the D series because the model trained on the B series was not good enough. This time, Gilles Pérez, a member of the project, took charge of the transcription. I recommended to create a new sample of 30 to 40 images, so he randomly selected series of 4 continuous pages from each ensemble. The transcription of the corresponding 36 pages was sent to me as a Word document. Therefore, on top of taking care of the segmentation of the images, I also went through an alignment phase during which I verified the order of the lines and copy-pasted the transcription. It took longer than I expected but it allowed me to align the transcription with the rules I had followed when creating the first set. I also picked 4 of the 36 pages to add to the test set.

The dataset is versioned and published applying the principles and tools we developed withing the frame of HTR-United. I also added illustrated segmentation and transcription guidelines.

Testing different dataset configurations to train transcription models

As I mentioned before, the goal of these datasets was to create transcription models. Taking the opportunity of the recent update of the dataset, I tried different scenarios.

I never trained the model from scratch because the dataset is too small to get any sort of usable model. Instead, I used Manu McFrench as a base model, fine-tuned with the Peraire dataset. (We were actually able to use Peraire as an example during the DH2023 conference³ earlier this month to show the usefulness of having this kind of base model). I tested fine-tuning only on the B series, only on the D series or on both the B and the D series. Then I used a B-series-only test set, a D-series-only test set and the full test set to see how the models performed.

Since I wanted to try it after discovering it during DH2023, I used CERberus 🐶🐶🐶 (I talked about it in my last post) to measure the accuracy of the models on the test sets listed above.

Like KaMI, CERberus takes 2 categories of text input: the reference (aka the ground truth) and the prediction (or the hypothesis made by the model). In order to get the prediction, I loaded my models on eScriptorium, as well as the images and transcription of the test set before applying each model to the documents. This way, all the transcription are predicted with the same segmentation, which comes from the ground truth.

Here are the results:

Manu McFrench, before fine-tuning, gets a CER of 26.16% when tested on the whole test set, and a score of 27.19% on the documents from the B series, 25.29% on the D series.
peraire_both, trained on the B and the D series, gets a CER of 4.63% when tested on the whole test set, but a score of 6.41% on the documents from the B series and 3.54% on the D series.
peraire_B, trained only on the B series, gets a CER of 8.72% on the whole test set, but a score of 7.12% on test-B and 9.67% on test-D.
peraire_D, trained only on the D series, gets an CER of 16.38% on the whole test set, but this is because of the enormous descripancy between its score on each sub test set. It skyrockets to a CER of 38,53% on test-B while going as low as 3.65% on test-D.

All of this makes sense, though.

ManuMcFrench could not be used without fine-tuning, its error rate on both documents is too high.
peraire_both is able to generalize from seeing both datasets and even benefits from seeing more data thanks to the D series, since it performs better on the B series compared to peraire_B.
peraire_B which was trained on the more difficult dataset seems to use the knowledge inherited from Manu McFrench and to have learned some formal features from Peraire's handwriting since it is able to maintain a fairly low CER on the D series (it gains 16 points of accuracy compared to Manu McFrench).
peraire_D on the other hand seems to lose it completely on the B series. This is most likely due to the fact that the contrast between the page and the "ink" is too low in the pencil-written series compared to the data used to train Manu McFrench and in the D series. peraire_D even loses 11 points of accuracy to Manu McFrench!

What happens with peraire_D is very interesting because it confirms that it is useful to compose a train set with examples of more difficult documents instead of only showing the ones that are easy to read! Now, the nice thing is that I will soon be working on a little experiment with my colleague Hugo Scheithauer where we will be able to measure the impact of the contrast between the ink and the paper. Stay tuned!

EDIT #1: I added the scores obtained by Manu McFrench alone.

EDIT #2: I added a disclaimer at the beginning of the post.

I used 2 images from B2 because one of them was extremely faded and I wanted to include some of these extreme cases in the dataset, and 2 images from B30 because it consisted of shorter lines (table of contents) which I found was interesting to include. ↩
As described in the documents, I only used the "InterlinearLine" and "DefaultLine" for the lines, and the "MainZone" and "NumberingZone" for the regions. ↩
See the submission and the slides on HAL: https://inria.hal.science/hal-04094241. ↩

012 - "It did a very good job"

Alix Chagué

2023-07-15

A few weeks ago, I attended the presentation of an automatic transcription software. The majority of the audience was unfamiliar with the concept of handwritten text recognition (HTR) or had little experience using it. The presentation lasted only an hour, so it couldn't delve into much detail. Its main objective was to demonstrate the software's results. The presenter showed several slides, displaying on one side images of manuscripts (often in a language unknown to the audience) and on the other side the transcriptions generated by the software. Throughout the presentation, the presenter repeatedly commented on the HTR software saying that "it did a very good job."

But what does it even mean?

The very first aspect to explore is what distinguishes a good job from a bad one. Normally, such an evaluation relies on the measurement of the accuracy of the result compared to the ideal transcription. The accuracy can be expressed positively or negatively using the error rates (a 0% error rate is the same as a 100% accuracy).

Measuring the accuracy of a prediction (another way to call the result of HTR) is commonly done at character level. The character accuracy of a model is equal to the number of matches between the prediction and the ideal transcription. The character error rate (CER) is a very common measure to express a model's theoretical efficiency.

Sometimes softwares also consider the word error rate (WER), which is the proportion of words in the prediction containing errors. A high score at WER doesn't actually mean that the transcription is bad. It only means that the errors are distributed on all the words. I never use WER alone because it is hard to get an exact impression of the quality of the prediction based on that metric alone.

There is a paper from Neudecker et al. (2021) where they test 5 different software used for evaluating the prediction. They also develop an interesting reflection on alternative metrics such as the "non-stopword accuracy", the "phrase accuracy", the "flexible character accuracy" (which is useful when the line order isn't always the same), the "figure of merit" (which "aims to quantify the effort required for manual post-correction" (p. 15)) or else the "unordered WER".

When your score is a rate, there is an implicit idea that 100% is both the maximum score and the targeted score (for accuracy of course). But in the case of HTR, 100% accuracy is extremely rare because there are also edge cases where the way a letter was drawn is ambiguous: in such cases the error is not particularly caused by the inaccuracy of the HTR engine but rather by the imperfection of the handwriting in the first place.

In Hodel et al., (2021), the authors provided a grid to interpret accuracy scores. They suggest the following three thresholds:

CER < 10% == good (it allows efficient post-processing)
CER < 5% == very good (errors are usually focused on rare or unknown words)
CER < 2.5% == excellent (but it is usually only reached when the handwriting is very regular)

Personally, I think this grid should also include 20% and 0%. 20% as a threshold, because at 80% of accuracy, the transcription is supposedly good enough for fuzzy search and keyword spotting (I should add a reference here, but I can't find it anymore...); and 0% because it should be reminded that an accuracy of 100% is virtually impossible.

To complement this, I would like to mention another possible approach to get an interpretable score: during the DH2023 conference, Thibault Clérice and I presented an experiment where we trained a model using the same data in the train set and the test set. Our model reached an accuracy close to 90%, which we were able to use as a baseline to define the highest accuracy score possible for the data we had. Thus we were able to consider that a model approaching 90% of accuracy would be an excellent model, as far as that dataset was concerned.

Still during the DH2023 conference, Wouter Haverals introduced CERberus 🐶🐶🐶, a web interface which addresses the same type of issues as KaMI: the lack of nuance in a plain CER computation. Indeed, in a CER score, every type of error has the same weight. This means that mistaking an "e" for a "é" costs the same as mistaking a "e" for a "0": in the first case the text is likely still readable or understandable, whereas in the latter, it might not be the case.

The CER metric is still very useful, but when applied to transcription projects, it is even more valuable when we can filter the types of errors we want to include in the evaluation.

EDIT: I should have noted here that my reflection was focused on the evaluation of an automatic transcription in cases where you already have the expected transcription. When we apply an HTR model to a whole new set of documents, we usually don't have the correct transcription at hand (otherwise we wouldn't use HTR in the first place). This is the reason why many researchers try to find ways to evaluate the quality of the transcription without ground truth. One example can be found in Clérice (2022).

So, to go back to our initial problem, we can see that there are many ways to draw the line between a good job and a bad one. The threshold will depend on the metric used to express the accuracy of the prediction and also (and actually mostly) on the way the generated text will be used down the line. Even though the software presentation I attended was short, I think we should always remind future users of HTR that 100% of accuracy is not always what they are seeking.

A short reflection to finish this post: I was bothered by the expression used to qualify the transcription. I am still trying to figure out a way to put it into words. On top of lacking accuracy, the expression "it did a good job" was also calling for a vision of HTR as a magic tool at the service of the searchers and students. But, in which other cases do you say that someone did "a good job?" Likely when you delegate a task to a subaltern.

I see a problem here: in their current state, HTR engines are efficient but not to the point that people can use them without thinking clearly about what they want the engine to produce. It is easy to sell a software pretending that it is a magic servant that will do all the transcription in your place, a tool so smart that you can even consider delegating a part of your responsibility to it. But I think when new users of HTR fail to first reflect on the outcome they can reasonably expect from these engines, it creates disappointment and crappy data and workflows.

011 - Working with synthetic data

Alix Chagué

2023-05-21

What we call synthetic data are data generated artificially, as opposed to data taken from real-life samples. In the case of automatic transcription or layout analysis, it corresponds to creating fake documents or samples of text that look more or less like real ones, in stead of manually annotating existing documents.

One of the main advantages of using synthetic data rather than real data is the fact that it comes already annotated. For automatic transcription for example, the annotation (transcription) is the same as the string of text passed to a text image generator. If you add to that the fact you can, in theory, generate an unlimited amount of pairs of text image and transcription, it represents an incredible opportunity to accelerate the production of training datasets. An example: Doush et al., 2018 use this technique to generate PDF containing contemporary printed Arabic texts. The PDFs are printed, then re-scanned and aligned with the transcription that was used to generate the PDFs. The result is the Yarmouk dataset. As we will see later, generating fake handwritten text is a bit more difficult.

Another advantage of this technique is that it offers an efficient way around the limitations posed by sensitive or confidential data (Hu et al. 2014). However, let's note that confidentiality is rarely a problem when it comes to training HTR models on historical documents.

Generating fake data is not specific to computer vision (Raghunathan, 2021), even though it is frequently used in this case because data for computer vision tasks are costly to produce. In general, it is a fairly frequent method when machine learning techniques are involved, disregarding the field of application (Kataoka et al., 2022). OCR and HTR tasks are not an exception and we can find traces of such experiments rather early (Beusekom et al., 2008).

The first time I was exposed to the notion of synthetic data was during a informal conversation with Tom Monnier in 2019. At that time, he was working on docExtractor, a layout analysis tool that he trained with images of documents generated artificially.¹

Then sometimes in 2021, while browsing through HuggingFace's spaces, I found ntt123's application that simulates handwriting. The application takes a text prompt as an input and generates an animation where the letters are traced on the page as if someone was writing them live. It's possible to play with two parameters: a value between 0 and 250 determining the writing style, and a weight determining the likelihood of the traced letters (the lower the weight, the higher the risk of hallucinated letters; the higher the weight, the more standardized the tracing). It made me think back to my conversation with Tom Monnier and I wondered if it could be used to generate pairs of text and images.

At the beginning of the year, I dedicated a good part of my time to testing data generation tools I could find online, to see if they could be used to create a set of fake ground truth that I would use later, in other experiments. I will introduce the latter in a future post, so let's first focus on handwritten data generation.

When I dug a bit more around ntt123's application, I was confronted with two things:

unfortunately, ntt123's application was developed in javascript and not documented at all which made it impossible for me to hack,
but luckily, it wasn't an original idea: instead it was one of many implementations of a proposition introduced by Alex Graves in 2014.

Alex Graves uses online² data from the IAM database (Liwicki & Bunke, 2005) and an LSTM (Long Short-Term Memory) to train a model capable of generating series of coordinates that trace letters and words. Initially, the model simply generates random series of letters and words, but it is then improved to take into account a text prompt which forces the models to generate a specific series of letters. As described before, the model also takes a weight (or bias) which normalizes the likelihood of the letters' shape, and can take a "priming line": the image of a handwritten line, whose writing style the model will try to copy. Once the coordinates are generated (including key information such as "ends-of-stroke"), it is easy to place them in an SVG file and visualize the result, with or without animation.

There are many many implementations of Alex Graves's experiment because it was such an important publication to demonstrate the usefulness of LSTM models. Several can be found on Github if you search "Alex Graves". For my experiment, I didn't want to develop my own adaptation of such a model, but rather to use programs that were ready to be used. This is the reason why I didn't look for papers but instead for recent (or recently updated) repositories on Github. I focused on Python programs because I wanted to be able to understand how they were developed.

One very promising implementation of Alex Graves' proposition was Evgenii Dolotov's pytorch-handwriting-synthesis-toolkit. It came with pre-trained models, and a utility scripts to feed the program a text prompt and generate an image. I customized³ the program a bit to fix a few bugs and try to make it generate several lines maintaining the same handwriting.

4 lines stating (or supposed to state) 'did a computer write this' generated by Evgenii Dolotov's program. The fourth line is a failed attempt where several letters like y, n, m can be dinstiguished. The fifth line states 'determined to act upon the assumptions' but contains several garbled letters.

Even though the generated images were sometimes impressively realistic, it created a lot of bad output. As suggested by Alex Graves, his solution tends to generate what he calls "garbled letters", letters that no human would likely trace. In other cases, it would randomly skip some letters and be completely incapable of tracing some numbers or punctuation signs. Sometimes, the model would simply draw more or less flat lines. Since I wanted to generate fake gold data that I could trust and since the results were not reliable enough, I played with the bias and the priming lines before trying to train new models using Evgennii Dolotov's utility scripts. I failed to get better results than the pre-trained models, and failed to find the correct parameters to make sure I would obtain always realistic output.

several flat lines that at one point successfully write 'is fin'. This is a failed generated image.

At this point I started exploring GANs (Generative Adversarial Networks) which are models based on game theory. They are capable of generating realistic fake images learning from samples of real images (see Goodfellow et al., 2020). They are this kind of models used to generate photos of people who don't exist. There are Github repositories offering source code to train such models to generate fake handwriting, such as GANwriting (described in Kang et al., 2020) or Amazon's ScrabbleGAN (introduced in Fogel et al., 2020) but they were only giving instructions to reproduce the corresponding papers and train the models ourselves. Since GANs are costly to train, I left this option out for the moment, even though I do think they can become an interesting solution in the future.

Eventually, I settled for a solution based on a Diffusion model. This type of model can be found behind applications like OpenAI's DALL-E. Luhman & Luhman (2020), who created the Diffusion Handwriting Generation (later called DHG), explain very well how diffusion models work.

"Diffusion probabilistic models [...] convert a known distribution (e.g. Gaussian) into a more complex data distribution. A diffusion process converts the data distribution into a simple distribution by iteratively adding Gaussian noise to the data, and the generative model learns to reverse this diffusion process." (Luhman & Luhman, 2020, p. 1)

A great advantage with DHG compared to the LSTM approach was that it was possible to easily fix the priming line and almost always obtain a convincing output. This was essential to create a dataset with a consistent handwriting over hundred of lines. As visible in the following image, even if the diffusion model is not capable of perfectly imitating the handwriting contained in the priming line, it usually successfully captures elements of style such as the slant, or the cursive nature of the text.

five pairs of priming lines with the resulting generated lines.

After several tests, I found that the third priming line gave the best results when associated with different text prompts, so I decided to use it along with excerpts from Moby Dick to create a completely artificially generated dataset. In a few days, I created more than 8,000 images (PNG) associated with a text file (TXT) containing the prompts used to generate them.

These pairs could have been used "as is" to produce a silver synthetic dataset but, like I said before, I needed a gold dataset where the text and the images would be exact matches. Unfortunately, more than a third of the images did not qualify as gold. After manually reviewing about 2,500 of the lines (with the help of my colleague Hugo Scheithauer), we published a set of 1,280 pairs of lines and text under the name "Spinnerbait".

Even though I was able to produce a dataset meeting my main criteria, I was actually disappointed with my results: I wanted a sort of magic button which would allow me to generate, at any time and without having to review it, a perfect set of training data. Instead, in the future, if I want to add more lines to Spinnerbait, I will have to spend a few hours going through each line to filter the bad ones out.

On the other hand, I decided to take a few hours to manually copy a text taken from Guillaume Apollinaire's poems. I copied the text following a txt file that I would edit every time I would start a new line, I scanned it, segmented it with eScriptorium before copying and pasting the lines from the txt file and exported the result as a series of XML ALTO and images. It gave birth to the Moonshines dataset, a set of 1,186 lines (including 170 dedicated to a fixed test subset) of a single hand, thus comparable in size to Spinnerbaits.

I think generating both datasets took about the same amount of time, if I take into account on the one hand reviewing the generated lines and on the other hand copying the text and passing it through eScriptorium. Moonshines used less computing resources and produced a richer dataset if we consider the aspect of the text. Also, the length of the lines is more varied in Moonshines whereas it is more homogenous (max 5 words) in Spinnerbait, because the generator tended to make more errors on longer prompts.

a line taken from the Spinnerbait dataset and a line taken from the Moonshines dataset

Another important limitation that I have barely addressed at this point it that not only do these tools fail to draw non-ASCII characters, but they also tend to have a greater chance of producing garbled letters when prompted with rare or non-english words⁴. This is true of all the systems I have tested. Of course, we could imagine training new models on data containing a greater diversity of languages, or simply other scripts or languages.

As way of a conclusion, I would say that even though I was disappointed with what I obtained down the line, this exploratory adventure was very interesting. I learned a lot and I am convinced that if I had more time and resources (and if it were more crucial for me), I would have found a way to get better results. I know of some coming publications that used GANs to create artificial data that look like lines taken from historical documents and I really look forward reading them.

It is possible to find here a recording of the talk given on this tool at ICFHR 2020. ↩
In the context of handwritten text recognition, a distinction is made between "online" data and "offline" data. Offline data are based on a matrix of pixels containing the image of a text (they are static), whereas online data are vectors containing information about the speed, the points through which a line passes to form a letter, end of stroke points, etc. Online HTR uses data generated with an e-pen and a screen while offline HTR uses images created with a scanner or a camera. ↩
One of the customizations consisted in removing non-ASCII characters or characters not supported by the model. It was easy to apply this transformation because in pytorch-handwriting-synthesis-toolkit, each model comes with a little metadata file which contains the character set handled by the model. ↩
All the models were trained using the IAM database, more often the "online" database, but sometimes also with the "offline" version. ↩

010 - Make and Read the docs

Alix Chagué

2023-02-28

During my last contract as a research engineer at Inria, I spent a lot of my time working on the project called LECTAUREP, in collaboration with the National Archives in France. The goal of this project was to explore new ways to index the content of thousands of thousands of notary registries which, put together, form one of the most used collections of the National Archives. I joined the project at the end of 2019, during its second phase, almost at the same time as eScriptorium was initiated. LECTAUREP had worked with Transkribus during the first phase (in 2018) but, given the connections between my research team and the team behind eScriptorium, we quickly switched to the newer software and contributed to its development.

One of my most important contribution is the redaction of a tutorial for the software, which was initially only intended as an internal resource for our team of annotators. This is the reason why the tutorial was published on LECTAUREP's blog. OpenITI, and in particular Jonathan Allen rapidly offered an English translation which, eventually, was also published on LECTAUREP's blog. Since the publication of this translation, it is listed on eScriptorium's home page as its official tutorial.

Unfortunately, the tutorial hasn't been updated in a long time whereas major updates and new features have been added on eScriptorium's side.

LECTAUREP's blog is not a good solution. It is built with Wordpress and hosted by Hypotheses which is very convenient to allow a small, well defined, group of people to collaboratively work on a research blog, but it's too heavy and not adapted to publish the documentation of a software like eScriptorium. The documentation needs to be updated frequently to keep up with the software and, in general, a blog is not a place to publish the extensive documentation of a software. To top it all, it is not even that easy to update for me, so can you imagine someone outside of LECTAUREP trying to offer an update?

I have been thinking of finding a better solution since at least 2020, but it was never so urgent that I was able to put it at the top of my to-do lists. Last Summer, I took the advantage of a rather slow couple of weeks in August, when every one but me seemed to have gone on vacations, to put something different in place.

Readthedocs quickly appeared to me as an ideal solution: the platform is designed for publishing software documentations, it handles software versions and multi-lingual contents. Last but not least, it uses static website generators. This is fundamental because it allows for the publication of the source code on a platforms like Github and will actually use this public source code to build the website.

Github is a platform designed for sharing and opening codes to external contributors. Relying on it solves a major issue with the current tutorial: if anyone can suggest the correction, edition or translation of eScriptorium's documentation, then it is more likely to keep up with the evolutions of the application!

In August, I created a new Github repository called escriptorium-documentation. I set a basic configuration and connected it to Readthedocs. As soon as this was done, the website became available at online with a URL based on the following structure: {gh_repo_name}.readthedocs.io. Then, I started rewriting the content of the tutorial... following Sphinx' syntax.

It was so painful that I never got back to it after I came back from my own vacations.

Why painful? Well, I had discovered Markdown in 2017 and I have used it since. It's so powerful and yet so light! In comparison, Sphinx felt like such a complicated and heavy syntax. Not as heavy as HTML, but less intuitive nonetheless. I had to go through the documentation every time I wanted to add something as simple as a hyperlink or an image!

In January, when I gathered enough motivation¹ to go back to working on eScriptorium's tutorial, I decided to look for an alternative to Sphinx compilers.

The only non-sphinx-based option available with readthedocs is Mkdocs. Like its name hints at, Mkdocs is a Markdown compiler, capable to quickly build websites. The set-up is really quick, it's well documented, fairly easy to customize and it's possible to add a lot of cool extensions which are based on Python. It was the bomb!

I liked Mkdocs so much that I also used it to rebuild my personal website!²

Over the past month, I have spent a lot of time working on this new tutorial for eScriptorium. I designed a basic structure, breaking down the features into different categories. Now the pages are progressively being filled and I am very happy to have been joined in my efforts by my colleagues Hugo Scheithauer and Floriane Chiffoleau. As we progressively merge the content of new pages to the main branch, the escriptorium-tutorial website expands. It will be ready soon for an official release!

I really hope that the transparency and simplicity brought by Mkdocs and Markdown will allow many people to add their contributions to the documentation of eScriptorium! Who knows, maybe you will too!

EDIT: we changed the name of the repository to escriptorium-documentation instead of escriptorium-tutorial (all links and mentions were changed in this post). The decision was motivated by the fact the "tutorial" felt like an inexact description of the actual scope/ambition of the project.

Also when I got more free time after my classes were over! ↩
It is not necessary to use readthedocs to deploy a website built with Mkdocs. In the case of the tutorial, it simply allows us to have a domain name more meaningful than ".github.io". ↩

A research (b)log

014 - RT(F)M for the Peraire Experiment

013 - The Peraire experiment

Segmenting, transcribing, aligning and publishing

Testing different dataset configurations to train transcription models

012 - "It did a very good job"

011 - Working with synthetic data

010 - Make and Read the docs