Skip to main content

022 - McCATMuS #5 - Training models

Last week, I visited Rimouski in the Bas-Saint-Laurent region of Québec, along the South-eastern bank of the St Laurent river. I was invited to contribute to discussions around the Nouvelle-France Numérique project, and I took this opportunity to present HTR-United, CATMuS as well as preliminary results on training a McCATMuS model. In preparation for this presentation, I conducted a series of tests on the two first models I trained. Today, this blog post gives me a space to discuss these tests and their results in more details.

The Kraken McCATMuS models were not directly trained on the HuggingFace dataset I introduced in my previous post, but rather on ARROW files created with the same ALTO XML files used to create the HuggingFace dataset. At the beginning of September, I wrote a Python script which reproduces the split of ALTO XML files into the train, validation and test sets, and which applies the same type of filtering of lines and modifications as I previously presented. Instead of generating the PARQUET files for HuggingFace, it simply creates alternative .catmus_arrow.xml files and three listings of these files, ready to be served to a ketos compile command1.

I used Kraken 4.3.13 to train the models on Inria's computation server because I've had dependency issues with Kraken 5 and haven't fixed them yet. The first model I trained strictly followed the train/validation split thanks to the --fixed-splits option. After 60 epochs, the model plateaued at 79.9% of character accuracy. When applied to the test set, this accuracy remained at 78.06%, a mere two points drop.

I trained a second model using the same parameters2 but without the --fixed-splits option, allowing Kraken to shuffle the train set and the validation set into a 90/10 split (the test set was left untouched however). This time, the training lasted 157 epochs before stopping, with the best model scoring with an accuracy of 92.8% on the validation set. When applied to the test set however, the model lost 7 points of accuracy (85.24%).

Learning curve for the model trained on the fixed split.
Learning curve (Character and Word Accuracies) for the model trained on the fixed "feature"-based split between train and validation.
Learning curve for the model trained on the non-fixed split.
Learning curve (Character and Word Accuracies) for the model trained on the random split between train and validation.

Although disappointing, this was consistent with the observations made when training the CATMuS Medieval model:

As anticipated, the "General" split exhibits lower CER, given the absence of out-of-domain documents, whereas the "Feature"-based split surpasses 10%. This higher score presents an intriguing challenge for developing more domain-specific models that consider factors such as script type and language. (from Thibault Clérice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagué, Jean-Baptiste Camps, et al.. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. 2024 International Conference on Document Analysis and Recognition (ICDAR), 2024, Athens, Greece. ⟨hal-04453952⟩ p. 15)

So, the drop in accuracy observed on the test set is, as suggested in Clérice et al, 2024, likely due to the fact that with a fixed-split, the model is both validated and tested against out-of-domain hands and documents (although the documents differ in the two sets). On the other hand, the model trained with a random split is validated against known hands and documents, but tested on out-of-domain examples.

The test set contains transcriptions of printed, typewritten and handwritten texts, covering all centuries. Limiting ourselves to only one accuracy score obtained on the whole test set would tell us very little about the model's capacity and its limitations. This is why I divided the test set into several smaller test sets based on the century of the documents and/or on the main type of writing present in the documents. For documents spanning over several centuries, I used the most represented century.

I only used the McCATMuS trained on the random split for these tests, because the accuracy of the other one was too low for the results to be meaningful. Instead of only testing McCATMuS, I also ran the Manu McFrench V3 and McFondue on the McCATMuS test set. They are two generic models trained on similar data (although with no or different normalization approaches).

Test set.............. ...McCATMuS... ...Manu McFrench V3... ...McFondue
All................... ...85.24... ...91.17... ...76.12
Handwritten........... ...78.72... ...89.40... ...75.17
Print................. ...96.37... ...94.15... ...78.30
Typewritten........... ...90.93... ...92.69... ...58.13
17th cent............. ...87.27... ...86.39... ...72.81
18th cent............. ...88.65... ...94.21... ...81.64
19th cent............. ...79.81... ...93.70... ...75.46
20th cent............. ...74.92... ...86.52... ...56.74
21st cent............. ...73.86... ...90.20... ...68.04
(HW) 17th cent........ ...58.69... ...64.83... ...64.26
(HW) 18th cent........ ...85.38... ...93.35... ...80.47
(HW) 19th cent........ ...79.81... ...93.70... ...75.46
(HW) 20th cent........ ...63.02... ...82.23... ...55.89
(HW) 21st cent........ ...73.86... ...90.20... ...68.04

I was initially surprised by the consistent margin Manu McFrench had over McCATMuS, considering it was trained on less data (73.9K + 8.8K lines, against the 106K + 5.8K lines) which had not been harmonized to follow the same transcription rules. However, these scores are actually biased in favor of Manu McFrench because several of the documents included in the McCATMuS test set were also used in Manu McFrench's train set. Even though this is not true for all documents, it concerns almost half of the test set. It might also be the case for McFonddue, but this model scores higher than McCATMuS in only one instance (handwritten documents from the 17th century). Creating a new test set, with documents that are not present in any of the train sets but follow the CATMuS guidelines, would be a good way to confirm this bias.

Additionally, I detected an issue in one of the datasets used in the test set: FoNDUE_Wolfflin_Fotosammlung contains some lines of faulty transcriptions, resulting from automatic text recognition, which most certainly cause an inaccurate evaluation of all three models.

A couple of examples of the faulty transcriptions, along with their CER they generate when compared to what would be a correct transcription (the CER is generated with CERberus):

Line image Faulty transcription Correct transcription Faulty CER would be
text line images reading, in print, "COLLECTION HANFSTAENGL LONDON" "CSTITHER, KIESERMAEAER AogS." "COLLECTION HANFSTAENGL LONDON" 89.29
text line image reading, in print, "NATIONAL GALLERY" "PEcLioL." "NATIONAL GALLERY" 175.0

I have planned to manually control this dataset and update the McCATMuS dataset accordingly. I don't know yet how many lines are affected.

The better accuracy of the Manu McFrench model is not just a product of the biases in the test set. I had the occasion to apply it to two documents, one from the 17th century and one from the 20th century. In both cases, Manu McFrench's transcription seemed more likely to be correct than McCATMuS's. This has led me to compare the training parameters used for both models and to start a third training experiment using Manu McFrench's parameters. In this case, the batch size is reduced to 16 (as opposed to 32) and the Unicode normalization follows NFKD instead of NFD.

If the results of this third training are consistent with the previous experiments, it will be interesting to see if adding more data to the training set will improve the results. Also, I have yet to test the model in a situation of finetuning.

As said at the beginning of this post, these results are preliminary, so I hope to have more to share in the coming weeks.


  1. The command looks like this: cat "./list_of_paths.txt" | xargs -d "\n" ketos compile -o "./binary_dataset.arrow" --random-split .0 .0 1.0 -f alto. 

  2. The configuration of Kraken for training these two model relies on the default network architecture, on a NFD Unicode normalization, a learning rate of 0.0001 (1e-4), batch size of 32, padding of 16 (default value), and applies augmentation (--augment). The --fixed-splits option is used for the first model. Following Kraken's default behavior, the training stops when the validation loss does not decrease for 10 epochs (early stops); this prevents the model from overfitting, which is confirmed when looking at the accuracy score of the intermediary models on the test set (orange line on the graphs). The training is done on a GPU. 

021 - McCATMuS #4 - Cleaning data, collection metadata

Preparing the data for CATMuS would certainly have taken much more time had I not been able to benefit from Thibault Clérice's experience with CATMuS Medieval. Not only was I able to build on the workflow he set up when he built it, but I also relied heavily on his scripts to parse and build the final dataset into PARQUET files that were pushed to HuggingFace. Most of these steps are described in Thibault Clérice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagué, Jean-Baptiste Camps, et al.. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. 2024 International Conference on Document Analysis and Recognition (ICDAR), 2024, Athens, Greece, presented at the ICDAR conference in Athens in a few days.

For McCATMuS, I started by downloading all the datasets (keeping track of the official releases) then I manually reorganized all the datasets so that the transcription and images were always under {dataset_repo}/data/{sub_folder}, which made later manipulation easier. Based on the notes I took while filtering the datasets, and after generating a character table for each dataset with Chocomufin, I created several conversion tables to harmonize the transcription. The conversions are a mix of single character or multiple character replacements ([ and [[?]]) and more or less sophisticated replacements based on regular expressions (#r#«).1

Here is a sample of the Chocomufin conversion table used for the LECTAUREP datasets. If the character is replaced by itself, it remains unchanged in the dataset, while replacing it allows either to remove a character from the dataset (the ¥) or to harmonize its transcription with the CATMuS guidelines (see œ and ° for example).

char,name,replacement,codepoint,mufidecode,order
#r ,Repl extra space before LEFT-POINTING DOUBLE ANGLE QUOTATION MARK,"""",00AB,,0
#r# »,Repl extra space before RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK,"""",00BB,,0
[[?]],replace [[?]] with ⟦⟧,⟦⟧,,,0
[?],replace [?] with ⟦⟧,⟦⟧,,,0
),RIGHT PARENTHESIS,),0029,),
m,LATIN SMALL LETTER M,m,006D,m,
É,LATIN CAPITAL LETTER E WITH ACUTE,É,00C9,E,
a,LATIN SMALL LETTER A,a,0061,a,
",",COMMA,",",002C,",",
e,LATIN SMALL LETTER E,e,0065,e,
^,CIRCUMFLEX ACCENT,^,005E,^,
œ,LATIN SMALL LIGATURE OE,oe,0153,oe,
̂,COMBINING CIRCUMFLEX ACCENT,̂,0302,,
W,LATIN CAPITAL LETTER W,W,0057,W,
°,DEGREE SIGN,^o,00B0,*,
¥,YEN SIGN,,00A5,,
½,VULGAR FRACTION ONE HALF,1/2,00BD,0.5,
h,LATIN SMALL LETTER H,h,0068,h,
r,LATIN SMALL LETTER R,r,0072,r,
æ,LATIN SMALL LETTER AE,ae,00E6,ae,
ȼ,LATIN SMALL LETTER C WITH STROKE,c,023C,c,
,RIGHT ANGLE,,221F,[UNKNOWN],

It wasn't possible to use a single conversion table for all the datasets because some had different transcription approaches. While replacing ¬ with - could, in principle, be used for each dataset, normalizing the way corrections and uncertainties were transcribed was another story. For example, in some of the CREMMA datasets, >< is used to signal a crossed word, while in other datasets <> is used. So replacing > with and < with in >hello< meant that in some cases we would successfully get ⟦hello⟧, while in other cases we would end up with ⟧hello⟦. There are a few documents where I had to manually intervene in the XML file to fix the transcription. In such cases, I fork the dataset repository to keep track of the corrected version of the ground truth or I push the correction back into the original dataset to create a new, more consistent version.

In general, the converted dataset is saved as .catmus.xml files, which allows us to keep track of the original ground truth and easily adjust the conversion table later if necessary afterwards.

In the second post of this series, I mentioned that "the CATMuS guidelines can (should?) be used as a reference point" and that "if a project decides to use a special character to mark the end of each paragraph, then in order to create a CATMuS-compatible version of the dataset, I should only have to replace or remove that character. In such cases, the special character that was chosen should be unambiguous and the rule should be explicitly presented." Providing a Chocomufin conversion table along with a dataset that uses project-specific guidelines would be an excellent practice to ensure that the dataset is indeed compatible with CATMuS.

Once all the .catmus.xml files were ready, I created a new metadata table for McCATMuS listing all the subdirectories under each dataset's "data" folder. This table was used as a basis to start collecting additional metadata at the document level rather than at dataset level, like the language used in the source or the type of writing (printed, handwritten or typewritten). Working at the document level is important because some dataset contain different types of writing and/or are multilingual. In some cases, when a document would mix different languages and/or different types of writing in the source, if the distinction could be made at the image level, I manually sorted them and created two different subfolders. This is what I did in the "Memorials for Jane Lathrop Stanford" dataset, for example: the subfolder "PageX-LettreX" mixed typewritten and handwritten letters, so I sorted them into "PageX-LettreX-handwritten" and "PageX-LettreX-typewritten" in order to have the most accurate metadata possible.

Other metadata included the assignment of a call number (or shelf mark) for each source represented in the datasets. In some cases a call number may apply to multiple subfolders, but in most cases, each subfolder is de facto a different document. Retrieving the call number is useful for several reasons: it allows for an accurate assessment of the diversity of documents in McCATMuS, it allows for a document to be associated with additional metadata found in its institution's catalog, or the list of call numbers can be used during benchmarking or production to check whether a document is known to the models trained on that dataset, thus explaining potentially higher accuracy scores.

In the few cases where the source used to build the ground truth did not have a corresponding call number, I simply made one up, keeping "nobs_" as a signal that it was a made-up call number. Thus, if "cph_paris_tissage_1858/" in "timeuscorpus" is now associated with its corresponding call number at the Paris archive center (Paris, AD75, D1U10 386), CREMMAWiki's "batch-04", which is composed of documents we created for the project, is associated with a made-up call number: "nobs_cremma-wikipedia_b04".

In the end, when the PARQUET files are created, the metadata from the table I just presented is collected, along with information extracted from parsing the contents of the XML file. Each of the metadata is then represented at the text line level. If you compare McCATMuS with CATMuS Medieval using HuggingFace's dataset viewer, you can see that they don't use exactly the same metadata.

"Language", "region type" and "line type" (which are based on the segmOnto classification), "project" and "gen_split" are common to both datasets, along with "shelfmark" I just described above. They both have a "genre" column with similar values (treatise, epistolary, document of practice, etc.). In the case of CATMuS Medieval, "genre" is complemented by "verse" (prose, verse).

Following Thibault's advice, I defined the creation date of a text line using two numbers ("not_before" and "not_after") instead of a single "century" value. This allows for a precise dating when it is possible or on the contrary, to spread the dating over several centuries when it cannot be avoided, which is more accurate in both cases.

McCATMuS mixes printed, handwritten and typewritten documents, so it was important to have a "writing type" column to help filter the dataset based on this information, in cases where one does not want to mix them. This metadata also makes it possible to use McCATMuS to train a classifier capable of distinguishing between the different types of writing. CATMuS Medieval on the other hand presents only handwritten sources, so such a metadata would be useless and is able to rely on paleographic classifications to characterize each text line based on a "script type" metadata, that includes values such as "caroline", "textualis", "hybrida", etc.

McCATMuS also has a "color" column that helps sort text lines based on whether the source image is colored (true) or in grayscale (false).

Although I reused the scripts developed by Thibault to build this dataset, I had to make several modifications to include this new metadata in the PARQUET files and to add additional filtering to the text lines. This included updating the mapping to the segmOnto vocabulary to match what existed in my datasets, or filtering some types of lines such as those identified as signatures.2 I also included an update of "writing_type" at the line level whenever the value in "line_type" allowed it to be controlled.

if ":handwritten" in line_type:
    writing_type = "handwritten"
    line_type = line_type.replace(":handwritten", "")
elif ":print" in line_type:
    writing_type = "printed"
    line_type = line_type.replace(":print", "")
elif ":typewritten" in line_type:
    writing_type = "typewritten"
    line_type = line_type.replace(":typewritten", "")
else:
    writing_type = metadata["writing_type"]

In the end, having built such a dataset (the first version of McCATMuS contains 117 text lines!) with such a variety of metadata is very satisfying although there is room for improvement. I have already mentioned that it would be interesting to have a greater variety of languages in McCATMuS. I also know that some of the values in "writing_type" are not completely accurate so adding a control based on a classifier might be interesting. Finally, I've noticed that some transcriptions in the "FoNDUE_Wolfflin_Fotosammlung" dataset are not correct at all, probably due to an automatic transcription that wasn't corrected.

However, before we dive into improving McCATMuS, it's important to first examine the accuracy of the models that can be built on top of it! This will be the topic of the next and last post in this series!


  1. To learn more about how chocomufin convert works, just read the software's short documentation. 

  2. I don't think it makes sense to include signatures in a dataset to train a generic model, since the transcription of such lines can be very context specific. 

020 - McCATMuS #3 - Datasets selection

HTR-United made identifying candidate datasets for McCATMuS a piece of cake. Once the rest of the CATMuS community agreed with the period to be covered by a "modern and contemporary" dataset, I created a simple script to parse the content of the HTR-United catalog and make a list of existing datasets covering documents written in Latin alphabet and matching our time criteria.

Actually, here is the script!

url_latest_htrunited="https://raw.githubusercontent.com/HTR-United/htr-united/master/htr-united.yml"

import requests
import yaml

import pandas as pd

# get latest htr-united.yml from main repository
response = requests.get(url_latest_htrunited)
catalog = yaml.safe_load(response.content)

def in_time_scope(dates):
    century_scope_min = 1600
    century_scope_max = 2100
    # this means that we allow datasets that intersect with the period
    if int(dates.get("notBefore")) < century_scope_min and int(dates.get("notAfter")) < century_scope_min:
        return False
    elif int(dates.get("notBefore")) > century_scope_max and int(dates.get("notAfter")) > century_scope_max:
        return False
    return True

filtered_by_date = []
for entry in catalog:
    if in_time_scope(entry.get("time", {})):
        filtered_by_date.append(entry)
print(f"Found {len(filtered_by_date)} entries matching the time scope.")

targeted_script = "Latn"
filtered_by_script = []
for entry in filtered_by_date:
    if targeted_script in [s.get("iso") for s in entry.get("script")]:
        filtered_by_script.append(entry)
print(f"Found {len(filtered_by_script)} entries matching the script criteria.")

cols = ["Script Type", "Time Span", "Languages", "Repository", "Project Name", "Dataset Name"]

metadata_df = pd.DataFrame(columns=cols)

selected_entries = filtered_by_script
for entry in selected_entries:
    row = {k:"" for k in cols}
    languages = [l for l in entry.get("language")]
    if len(languages) == 1:
        row["Languages"] = languages[0]
    elif len(languages) > 1:
        row["Languages"] = ", ".join(languages)
    else:
        print("Couldn't find a field for language in this repository")
        row["Languages"] = "no language"
    # get centuries/y
    row["Time Span"] = f'{entry.get("time").get("notBefore")}-{entry.get("time").get("notAfter")}'
    row["Project Name"] = entry.get("project-name", "no project name")
    repository = entry.get("url", "no url found")
    if repository.startswith("https://github.com/"):
        row["Repository"] = repository.split("https://github.com/")[-1]
    elif repository.startswith("https://zenodo.org/"):
        row["Repository"] = repository.replace("https://zenodo.org/", "zenodo:")
    else:
        row["Repository"] = repository
    row["Dataset Name"] = entry.get("title", "no title found")
    script_type = entry.get("script-type")
    if script_type == "only-typed":
        row["Script Type"] = "Print"
    elif script_type == "only-manuscript":
        row["Script Type"] = "Handwritten"
    else:
        row["Script Type"] = "Mixed"
    metadata_df.loc[len(metadata_df)] = row

metadata_df

I saved the output as a CSV and proceeded to go through each of the selected datasets and its metadata. I checked several things:

  • I made sure the datasets were available and easy to download. For example, I excluded those requiring manual image retrieval.
  • I checked the format of the data because I decided to initially focus only on datasets available in ALTO XML and PAGE XML.
  • I controlled the overall compatibility between the transcription guidelines used for the dataset and those designed by CATMuS.
  • I also checked the conformity of the dataset when trying to import it into eScriptorium. This import allowed me to detect when there was a discrepancies between the names of the image files and the value for the source image in the XML file which prevented the import from successfully running.1
  • Loading a sample of the dataset in eScriptorium also allowed me to visually control other incompatibilities with CATMuS that may not have been documented by the producers of the data.2
  • Finally, I considered the structure of the repository and, when necessary, the facility to reorganize it into a single data/ folder containing the images and the XML files, often distributed among sub-folders.

I assigned each dataset a priority number from 1 to 6. The lowest number was for dataset compatible with CATMuS without any modification (no dataset was giving a priority rank of 1...) and 6 for massive datasets that would require a nerve-racking script to be built correctly. My grading system is shown below.

  • 1=ready as is
  • 2=need to be chocomufin-ed
  • 3=require manual corrections but the dataset is very small, or the dataset is chocomufin/catmus compatible but requires a script to build it
  • 4=require manual corrections but the dataset is relatively big, or require a script to be fixed
  • 5=require manual corrections but the dataset is really big
  • 6=require manual corrections but the dataset is really big and require a personalized script to be built

For example, "Notaires de Paris - Bronod" had to be modified to comply with CATMuS requirements. This included replacing [[ and ]] with and , or also to ignore lines containing ¥, a symbol used in LECTAUREP's datasets to transcribe signatures and paraphs. These were straightforward modifications, thanks to Chocomufin. On the complete opposite, "University of Denver Collections as Data - HTR Train and Validation Set JCRS_2020_5_27" is a massive dataset (2660 XML files), but there are segmentation errors in this dataset, creating erroneous transcriptions given the way the line is drawn, and the annotation of the superscripted text is not compatible with CATMuS. To make it compatible with CATMuS, it would be necessary to control and correct each page one by one.

I chose to focus on datasets with priority 2 for the first version of McCATMuS. Indeed, it'll be possible to add more datasets into CATMuS in later versions, so there was no need to spend too much time on manually cleaning datasets. I had 23 with priority 2 to go through.

Identifying eligible datasets was not as time consuming as cleaning them and collecting additional metadata turned out to be. However, it gave me a good idea of the challenges I would face when trying to aggregate the datasets. I would have liked to be able to find a greater diversity of languages, but this is wasn't possible at this stage, mainly because many non-French datasets require more elaborate corrections than applying Chocomufin and were thus given a priority score higher than 2.

The next post will be covering the tedious phase of data cleaning and aggregation, along with metadata collection!


  1. It was the case in "Données vérité de terrain HTR+ Annuaire des propriétaires et des propriétés de Paris et du département de la Seine (1898-1923) where the ALTO XML files are not explicitly linked to their corresponding source images. I believe it can be fixed, but it would require creating a script just for this purpose and the dataset presented other incompatibilities with CATMuS' guidelines. 

  2. For example, "Argus des Brevets" contains some segmentation errors that will need to be corrected manually. 

019 - McCATMuS #2 - Defining guidelines

Previous experiments have shown that conflicting transcription guidelines in training datasets make it less likely that a model will learn to transcribe correctly. This is particularly relevant when it comes to abbreviations and it's something to keep in mind when merging existing datasets. We didn't really address this when we trained the Manu McFrench model because it's difficult to retroactively align datasets to follow the same transcription rules. Unless you can afford to manually check every line, of course. In the case of Manu McFrench however, we only merged datasets that didn't solve abbreviations, so we ensured a minimum of cohesion.

CATMuS was built on the foundation laid by CREMMALab and the annotation guidelines developed by Ariane Pinche at the end of a seminar organized in 2021. These guidelines are intended to be generic, meaning they should be compatible with most transcription situations and are not project-specific. Following these guidelines will help data producers create ground truth that is compatible with data from other projects. It will also help those projects save time by not having to create transcription rules from scratch. From my experience, it is indeed easy for the members of a project discovering HTR to get caught up in the specifics of one project and forget what is and is not relevant (or even complicating) in the transcription phase.

It's worth mentioning that a project can choose to follow some of the CATMuS guidelines, while maintaining more specific rules for certain cases. If that's the case, the CATMuS guidelines can (should?) be used as a reference point. Ideally, the specific rules defined by a project should be retro-compatible with CATMuS. For example, if a project decides to use a special character to mark the end of each paragraph, then in order to create a CATMuS-compatible version of the dataset, I should only have to replace or remove that character. In such cases, the special character that was chosen should be unambiguous and the rule should be explicitly presented.

As CREMMALab focused on the transcription of medieval manuscripts, so did the first CATMuS dataset and guidelines. As I said in my previous post, I focused on data covering the modern and contemporary periods, for which there was no equivalent to the CREMMALab guidelines. So, when extending CATMuS to these periods, I started with collecting existing guidelines and comparing them. I used the CREMMA Medieval guidelines, the CREMMA guidelines for modern and contemporary documents, SETAF's guidelines and CATMuS Print's guidelines as a basis to elaborate the transcription rules for McCATMuS.

For each rubric, I compared what each set of rules suggested, when they covered it. It was rare for all guidelines to align, but some cases were easy to solve. For example, all the guidelines recommended not to differentiate between regular s (⟨s⟩) and long s (⟨ſ⟩), except for the rules I had set for the modern and contemporary sources transcribed by CREMMA in 2021, before the CREMMALab seminar. It was thus decided that for McCATMuS there would be no distinction between all types of s's.

Some rubrics needed to be discussed to figure out why the rule had been chosen in the first place by some of the projects, to decide which one to keep for McCATMuS. In February, I met with Ariane Pinche and Simon Gabay to go over the rubrics that still needed to be set. One example of a rule we discussed is how hyphenations are handled. CATMuS Medieval and the two CREMMA guidelines say to always use the same symbol (⟨-⟩), whereas for the SETAF and CATMuS Print datasets, inline hyphenations (⟨-⟩) are differentiated from hyphenations at the end of a line (⟨¬⟩). Other symbols, like ⟨⸗⟩, were unanimously rejected.

Two factors were considered when making those decisions: the feasibility of a retro-conversion for the existing datasets and the compatibility of the rule with a maximum of projects. In the case of hyphenations, I eventually decided to follow the same rule as CATMuS Medieval and CREMMA. On top of simplifying the compatibility of McCATMuS with CATMuS Medieval, I found that replacing all ⟨¬⟩ with ⟨-⟩, rather than retroactively place ⟨¬⟩ where there was indeed an hyphenation at the end of a line1 was much more straightforward.

Once the set of rules was fixed, I used it to sort between the different datasets I had identified (I'll discuss this in the next post) and to decide which one would be retained for McCATMuS v1. I also defined the transformation scenarios necessary to turn each of these datasets into a CATMuS-compatible version. Then, once McCATMuS v1 was ready, I integrated the modern and contemporary guidelines into the CATMuS website, where the transcription guidelines for CATMuS medieval were already published.

Now that I am done integrating the rules set for McCATMuS into the website, I am confident that we have successfully designed rules that are overall compatible across the medieval, modern and contemporary periods, despite some unavoidable exceptions. Two good examples of the impossibility to cover a whole millennium of document production with the same rule are the abbreviations and the punctuation signs.

I've now explained how the transcription guidelines were established for McCATMuS. Next, I'll cover how they were integrated into existing datasets to create the first version of the McCATMuS dataset.


  1. You can't assume that every instance of ⟨-⟩ at the end of a line must be replaced with a ⟨¬⟩. In many cases, this can be a simple typographic decoration marking the end of a paragraph or the end of a title. 

018 - McCATMuS #1 - Overview

Last week, I attended ADHO's annual conference in Washington DC. I presented a short paper, co-authored with Floriane Chiffoleau and Hugo Scheithauer, about the documentation we wrote for eScriptorium (I wrote a post about it last year and you can also find our presentation here). I was also a co-author on a long paper presented by Ariane Pinche on the CATMuS Medieval dataset.

CATMuS, which stands for "Consistent Approach to Transcribing ManuScripts", is a collective initiative and a framework to aggregate ground truth datasets using compatible transcription guidelines for documents from different period written in romance languages. It started with CATMuS Medieval, but since January this year, I have been working on a version of CATMuS for the modern and contemporary period.

While I should (and will) try to publish a data paper on CATMuS Modern & Contemporary (I'll call it McCatmus from now on), I figured I could start with a series of blog posts here. I want to describe the various steps I followed in order to eventually release a dataset on HuggingFace and hopefully soon the corresponding transcription model.

I started working on McCatmus in January, but because of a major personal event (I moved to Canada!), it took seven month of stop-and-go before the release of the V1. This was particularly challenging due to the scale of the project and its technicality (it was hard to get back into McCatmus after several weeks of interruption, which I had to do several times).

To add to this complexity, McCatmus was also a multi-front operation. Indeed, to create McCatmus, it was necessary to:

  • define transcription guidelines in collaboration with other data producers,
  • identify datasets compatible with the guidelines and set priorities,
  • actually make all the dataset compatible with each other and clean some of the data,
  • model and collect metadata that made sense for this dataset,
  • release the dataset and fix the issues that came up.

To this date, two tasks remain on my to-do list for McCatmus: train a transcription model corresponding to this dataset and compare it with other existing ones, and make sure to have a publication describing this dataset and its usefulness.

My plan is to dedicate one post to the creation of the guidelines for the dataset, then a post about the identification and collection of the datasets used in McCatmus v1, and then I'll wrap up with a post about the process to create the dataset, the metadata and the release. Stay tuned!

017 - Deploying eScriptorium online: notes on CREMMA's server specifications

eScriptorium is a web application designed to perform automatic text recognition campaigns, by default powered by the OCR/HTR engine Kraken. It comes in a decentralized form, meaning that the application is not distributed by a single organization but can, on the contrary, be deployed by several actors on many different servers. In fact, you can also deploy eScriptorium on your personal machine, simulating a local server.1

As eScriptorium is gaining attention, more institutions are interested in building their own server to host the application and offer it to their associates. At Inria, we deployed eScriptorium for the first time in 2020, specifically for the project called LECTAUREP which we ran with the French national archives between 2018 and 2021. While the initial server was hosted on a virtual machine, without any GPU, and open to a relatively small amount of users, our current eScriptorium application already counts nearly 500 users and will soon be hosted on a much different server infrastructure, funded by the CREMMA project. Between the original LECTAUREP-eScriptorium server and the CREMMA server, we moved to a dedicated server (Traces-6) for which we invested about 20K€.

Since I have been regularly in touch with people from different institutions who were looking into buying the hardware to create their own server for eScriptorium, I thought it was largely time to put all the deets in writing!

To write today's post, I'm very happy to welcome a second pair of hands: Thibault Clérice's. His expertise and involvement in designing CREMMA server are crucial here!

Let's first discuss some technical requirements, then we'll describe how the CREMMA server was designed. We finish with some very important remarks on the necessity (or not) to build a server and on useful alternatives for the community!

Should you buy GPUs?

GPUs (or Graphics Processing Units) are not mandatory at all when you use eScriptorium. This is the reason why it is perfectly acceptable to run eScriptorium locally, on your own computer. Actually GPUs are not even mandatory to train Kraken models: training can be done on CPUs (your computer's processor), they will simply go much much much slower.

That, however, is true for personal or light use of the training features. If on the contrary you create a server open to dozens of users or more, then connecting eScriptorium to GPUs is very much a good idea: since training a model on a CPU alone can take 2-3 days (or much more), you don't really want 10 users to start a training task at the same time. In the absence of shared GPUs, their training will be queued for days or even weeks and the overload might degrade the experience of other users on the rest of the application. As long as we are building an infrastructure (and hopefully sharing costs), we may as well enhance the experience of everyone, no?

This being said, you shouldn't rush and go buy a GPU right away. Instead, you should first look at options to optimize its usage or at infrastructures that are already available to you. For example, the FONDuE infrastructure, at the University of Geneva, doesn't use the GPUs only for eScriptorium: they connect their application to a cluster which is used by researchers for intense computation tasks outside of eScriptorium (it's an HPC with a university-wide queue controlled by SLURM). This is a very good solution for optimization, because training Kraken models is not a constant activity: if the GPU is dedicated to eScriptorium only, then it will be used for a few hours here and there, not even at 100% of its capacity. Think of it: users of the application will usually need to train a model at the beginning of their transcription campaign, therefore once they have an accurate model, they will focus on using the model for prediction, which doesn't rely on the GPUs (and Kraken isn't really optimized for GPU usage at prediction time anyway).

Other possibilities include connecting the server to a completely physically separate cluster where training jobs are submitted. This is a possibility that several people told me they were exploring, but I don't know if anyone has set it already. Why would you opt for a solution with an external cluster? To replace some huge investment costs (original funding) with some smaller (but much more regular) functioning costs: for example, for CREMMA, nearly half of our 40K€ budget was spent, in 2022, on buying two A100 graphic cards from Nvidia. When using someone else's GPUs, not only you save the money you would spend on the hardware, but on top of that, you contribute to optimizing the use of other GPUs already in place. Another reason is because you might not have the human resources to administer the system and the GPUs. There are multiple calculation clusters created for Academia (of the top of our head: Jean Zay or Calcul Québec), and you could even consider using commercial solutions as well (like AWS, Google Cloud and the like). Then, your money is spent on the actual computation and not on making the computation possible in the first place.

Fair enough, plugging eScriptorium's task manager to an external server might not be that simple. However, for smaller groups of users, it is also worth taking into account that it is perfectly possible to train Kraken models using Kraken directly (through an SSH connection to a (super-)cluster, for example) before uploading them into the application. In such a case, eScriptorium is only used for its ergonomics, not as a simplified interface to train models.

Let's summarize the point here: GPUs are not always a must-have for eScriptorium or Kraken, so you should definitely consider first and foremost your future usage. They currently represent the biggest share in the hardware expenses to build a calculation server. There are options out there where you don't spend 10K€ to buy a GPU but rather connect to an external, ready-to-use service. Or, if you do decide to spend the money, you should consider ways to maximize its usage for other training tasks, possibly outside of eScriptorium.

Some considerations on storage

Normally, eScriptorium is used as an (assisted) annotation environment to obtain the transcription of documents. You would use eScriptorium:

  1. In a preparatory phase:
    • (1a) to produce training data, and
    • (1b) to elaborate (aka train) performant segmentation or transcription models;
  2. In a production phase, but only for relatively small corpora, to apply segmentation and transcription models and manually correct the results (in which case the size of the corpora must be compatible with the scale of what an individual or your assembled team can process);
  3. In a post-production phase, including for samples of a very large corpus, to easily visualize and control the result of the (large-scale) automatic prediction and potentially correct it (cf. n°2).

On the other hand, large scale transcription campaigns should probably be led with Kraken in the command line directly (so only n°1 and n°3 necessitate eScriptorium). Thibault has even produced a small python library to design such campaigns (RTK, for Release the Krakens) which was recently used in a paper2 where a 38.5M token corpus was produced. In some cases, n°1b even benefits from being performed outside of eScriptorium, since the application offers a very limited control over Kraken's training parameters.

This has several consequences on the way you should consider storage on a server dedicated to eScriptorium. Duplicates of images are created on the server while they are being processed in the application, but they should always be considered as such: temporary duplicates while phase 1, 2 or 3 are under progress. They shouldn't be considered as if eScriptorium was 1) an archiving solution for transcription projects, 2) a querying interface to explore a corpus or even 3) a publication environment for a minimalistic digital edition. eScriptorium is only one brick --an early one even-- in the corresponding pipelines. Instead, the original image files should be stored somewhere else, in an adapted data warehouse (like Zenodo, Nakala, etc.), or published in digital libraries under the responsibility of their owner (like Internet Archive, Gallica, etc.).

What this means when designing a server to host eScriptorium is that its storage capacity should of course be big enough to store the temporary image files,3 while users are working on their annotation, aka the active projects. However, this storage doesn't need to be expended all the time and it should also be ok to flush the terminated projects: at that point the images and their annotations should have been archived on more appropriate data warehouses by their creators, and it should be their responsibility.

Don't forget the RAM!

Not overlooking the RAM is very important when designing your server! But what is it used for? It's used for cache by the web application: it means that frequently accessed data, like web pages and images but also the content of the database, are temporarily loaded in live memory. Cache thus ensures that the requests sent by the users are served quickly. For example, if you don't have enough RAM (or enough cache), pages will load slowly, and if you have used eScriptorium before reading this post, you know how important it is to be able to load images fast enough.

RAM is also essential for inference and training because images and annotations are loaded in memory before being passed to the CPU or the GPU. If the RAM is not powerful enough, it will be detrimental to computation and will cause a bottleneck situation. Thus having invested in GPUs and/or CPUs but not in enough RAM would be like having a horse to pull a Ferrari: even if prediction and training could go fast on the processing units, it will be restrained by the available live memory.

Modularity for the CREMMA infrastructure

The CREMMA infrastructure was originally designed by Thibault with a simple but essential principle in mind: modularity. Instead of thinking of an eScriptorium server as a monolithic block of hardware designed for front-end service, storage and intense computation, he suggested to break each of these blocks into individual servers connected together. CREMMA4 is thus made of at least three servers, as shown in the schema below:

  • CREMMA_FRONTEND, for the front-end, where the application is deployed and where the database is stored.
  • CREMMA_STORAGE, for storage, where all the images and models, as well as the backup of the database are stored on the long term. Currently, CREMMA_STORAGE has a storage capacity of 38Tb5 but we could easily add more disks if we find that it is necessary.
  • CREMMA_COMPUTE, where the two A100 GPUs I mentioned earlier are plugged and where the application task manager "sends" all the jobs, whether they are to be run on CPU (these tasks include segmentation and transcription prediction for example), or on GPU (training for the most part).

A model of the CREMMA infrastructure where three blocks (front-end, storage and compute) are connected together through an intranet 10Gb/s connection. For each block, one or two server(s) is presented along with their specification. Credits: Thibault Clérice and Alix Chagué. The full text of the specifications is accessible in a commentary in the source code of this page, just after this image.

As you can see on the schema, there will actually be a fourth server involved in the infrastructure: Traces-6, the server we currently use to deploy eScriptorium at Inria. Like CREMMA_COMPUTE, Traces-6 can be called by CREMMA_FRONTEND for computation tasks. In fact, this is where the modularity of the system is interesting: with such a set-up, it is possible to add more computation servers to the pool of GPUs reachable by CREMMA_FRONTEND without having to redesign the whole infrastructure. On their side, CREMMA_FRONTEND and CREMMA_STORAGE can be upgraded (to add more RAM or more storage) very easily.

This modularity also means that the GPUs remain free for other uses: for example if we were to have to run maintenances on CREMMA_COMPUTE, we can simply cut it from the infrastructure, and let CREMMA_FRONTEND interact with Traces-6 only while we work on CREMMA_COMPUTE.

CREMMA_COMPUTE is equipped with two A100 graphic cards, and Traces-6 with two RTX 6000. Actually, it doesn't mean that only 4 training can be happening at once. Each of these GPUs offer between 24 and 40 Gb of RAM for intense computation. It's a lot. It's so much actually that training a Kraken model at max speed would rarely use more than 40% of this processing power. Virtualization is a nice trick to "break" the GPU down into smaller virtual GPUs (or vGPUs). What is broken down is the RAM capacity. We opted for the following virtualization set up:

  • Each of the A100 graphic cards and their 40Gb of RAM are turned into 1 10Gb vGPU + 5 5Gb vGPUs (since 10+5x5=35, note that we must leave 5Gb out of the equation for the virtualization).
  • No virtualization is applied to Traces-6's RTX6000s.

How did we decide on these numbers? Thibault ran a series of small tests executing either segtrain or train and playing with two different parameters: the batch size6 and the single point precision7. He found that for training a recognition model with a batch size of 8 and either 32 or 16 of precision, less than 5 Gb of RAM on the GPU is enough. With a batch size of 1 and a precision of 32, it's even less than 1 Gb. To train a segmentation model, less than 10Gb is enough, and this type of training is more rare. Since our goal for the infrastructure is not to maximize the speed of the training but to maximize the amount of possible parallel training jobs at decent speed, we decided that 10 vGPUs with 5Gb of RAM and 2 vGPUs with 10Gb of RAM were a good compromise. If we find that more GPU RAM is occasionally needed, we still have two times 24Gb with the RTX6000!

Should you build your own server?

We have spent all this time writing about how to build, how to spec out your server or your infrastructure, but let's talk about the elephant in the room: should you do it?

Well, it's all a matter of perspectives. We'd say it probably makes sense if:

  1. You are a very big organization, you have a lot of money available to you, a super-cluster (and possibly a well staffed IT services department), and you have a high demand;
  2. You are working on very sensitive data that can't be shared with the outside (e.g. medical reports);
  3. You are geographically far away from any other existing server, and face latency issues when you connect to potential welcoming servers;
  4. Servers that exist around you are reluctant to onboard you and the teams behind the request for a server of your own.

These four points are definitely valid. But we'd say that, if you are in another situation, sharing infrastructural costs probably makes way more sense. In our experience, building a server is long, tedious, require special (and rare) skills8 and costly (in terms of human resources as well!). Setting up a working server can take a really long time. For CREMMA, we ended up outsourcing part of the installation of the new infrastructure because we realized that we did not have the time nor the skills to set everything up ourselves. The cost of this installation by a third-party? Between 8 and 12K€, and again, a little time and bandwidth on our end.

Next you have the maintenance fees. You can outsource them, for a little bill from a company which would make sure that everything is installed on time, that updates work well, etc. Or you can do the maintenance yourself. But again, this comes with a cost: human time. A worker on the server goes down? You are in for a few hours. Some people crashed a third-party server by uploading too much IIIF images on your instance of eScriptorium? Well, then you will not only receive emails from these third parties (and this is completely normal), but also have to deal with your user base doing things that eScriptorium allows and that you may not (yet) be able to control/limit.

In the end, we would definitely recommend that, when this is possible, you first consider joining existing servers, including by offering quid pro quo by:

  1. Participating in covering the salary of people maintaining the server (through some kind of yearly fees for example);
  2. Providing some money to expand the existing infrastructure (to increase storage or computation, etc);
  3. In general, helping eScriptorium grow, discussing with the owners of the server you are joining and/or the eScriptorium team about what kind of new functionality should be added, and if you can contribute to fund these updates.

This final point is super important: sure, owning your own server sounds appealing, even if it is costly to put in place. However, developing eScriptorium also comes with expenses. Thus, participating in eScriptorium directly -- we think -- is also very beneficial and welcome by the developing team. Open-source is free to use, free of charge but is not appearing out of thin air: developing costs money. And the more people participate in infrastructural costs (servers or software), the better the experience will be.


  1. If you don't know anything about local servers and are curious to learn more, you can check this page: https://www.freecodecamp.org/news/what-is-localhost/. Or you can also take a look at the corresponding entry in Wikipedia! 

  2. The full reference is: Jean-Baptiste Camps, Nicolas Baumard, Pierre-Carl Langlais, Olivier Morin, Thibault Clérice, et al.. Make Love or War? Monitoring the Thematic Evolution of Medieval French Narratives. Computational Humanities Research (CHR 2023), Dec 2023, Paris, France. ⟨hal-04250657⟩ 

  3. By temporary, we don't mean that the image file are stored for a few hours only, on the contrary, they can stay on the disk for many years. We mean that it should be ok to consider that they can be erased whenever a user is done working on a corpus and has moved away from the transcription phase. 

  4. From now on, "CREMMA" means the server created through the CREMMA project. 

  5. Safety first! We have 38 Tb available, but there is actually a little more physically because we have redundancy and spare. We have 2 series of disks working with redundancy (RaidZ). In each series two disks are entirely dedicated to redundancy only, and one more is completely unused until something fails (it is used as a safety spare disk). While CREMMA_STORAGE, as we said before, is not used as a permanent storage solution, it needs to be a little bit safe for the user base. 

  6. To understand what the batch size corresponds to and why it is important, you can check this entry in the Stack Exchange forum: https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

  7. To quote Kraken's documentation: "When using an Nvidia GPU, set the --precision option to 16 to use automatic mixed precision (AMP). This can provide significant speedup without any loss in accuracy." Kraken's default value for precision is 32. 

  8. It can be difficult to justify hiring a full-time or even part-time system administrator for a team because it is a very specialized and highly demanded type of profile. For example, public organizations can rarely offer competitive salaries compared to the private sector. In addition, the workload for administrating a web server can be irregular, and it can be difficult to make the skills for system administration meet with other needs faced by a team, complicating even more offering a meaningful full-time job. 

016 - Text Recognition, Large Models and Expectations

Since the boom around ChatGPT almost a year ago, I've heard several people wondering if "tools like ChatGPT" were more efficient than HTR models trained with Kraken and the like. The glimmer of hope in their eyes was most likely lit by their own struggle to set successful and/or efficient HTR campaigns with more traditional tools. The capacity of Large Language Models (LLMs) to reformulate a text1 or, more specifically, of Large Multimodal Models (LMMs) to generate text based on a visual input may indeed lead people to believe that HTR technologies built on CNNs are on the verge of being flipped upside-down.2

Annika Rockenberger recently conducted a series of small experiments on the matter and wrote an interesting blog post about it. Let's summarize it!

She signed up for a premium subscription (25$/mo) to be able to chat with GPT4, which allows users to upload images. Then she submitted printed or handwritten documents she would normally transcribe with Transkribus and assessed the results. She found that GPT4 was fairly good on ancient print (German Fraktur) and that it was even able to follow transcription guidelines if provided with an example. However on a letter bearing handwritten cursive, the model completely hallucinated the content and attempted a transcription in the wrong language. This didn't change when she provided more context on the document. Rockenberger concludes that there is a potential for using ChatGPT for HTR but that the capacity of scaling it up is completely unsure and that learning how to provide good prompts to get the appropriate results is a challenge. I would also add that in the end, Rockenberger paid 25$ to get 10 lines of raw text, whereas with software like Transkribus or eScriptorium, she would also get a standard structured output.

So, in other words, after reading Rockenberger's post, one can conclude that GPT4 (or, better, similar free and open source models) does have a potential for "quick and dirty-ish" OCR. However, I would argue that users tempted by this strategy might still miss an important point: even LMM-based tools will requires a little bit of organization and precision from the users. This, I find, often lacks in unsuccessful HTR campaigns. LMMs could generate a good output, but you will likely have to pay a counterpart one way or the other(s): with lower text recognition quality, with hallucinated text content, with impoverished non-structured output, with premium fees, etc.

Earlier this year, an article proposed by Liu et al. (2023), "On the Hidden Mystery of OCR in Large Multimodal Models", explored almost exactly the same topic but in a more comprehensive way. Their article presents an extensive survey of how well several Large Multimodal Models (LMMs) performed on "zero-shot" tasks.

Zero-shot refers to the act of requesting an output from an LLM or a LMM without training it for this task in particular. It is very similar to Rockenberger's first attempt with GPT4, when she uploaded the image of a printed document and asked for its transcription. In such a case, she relied on the capacity of the model to transfer its knowledge to the specific tasks of Text Recognition, on a specific type of documents (historical printed text).

Other terms are often associated with "zero-shot:" "one-shot" and "few-shot". One-shot is equivalent to Rockenberger's second attempt: when she showed GPT4 an example of the output she expected on the 10 first lines of the documents, and requested that the model copied her strategy to generate the transcription of the 10 next lines. Few-shot would mean showing several pages and several expected output to the model before asking for the transcription of a new document.3

The paper focused on currently available LMMs representing five different approaches for training LMMs:

They evaluated the models on 4 tasks: text recognition, text-based visual question answering, key information extraction and handwritten mathematical expression recognition. Here are a few examples of what these tasks entail, as illustrated in the original article (on the images, P stands for Prediction and GT for Ground Truth):

Task Example
Text Recognition Examples of failed Text Recognition
Visual Question Answering Examples of failed Visual Question Answering
*Key Information Extraction Examples of failed Key Information Extraction
Handwritten Mathematical Expression Recognition Examples of failed Handwritten Mathematical Expression Recognition

For each task, they used several datasets presenting different challenges. For each of these datasets and tasks, they retrieved the scores of the state-of-the-art (sota) for supervised methods and used them as a baseline. For example, for text recognition on the IAM dataset, the sota method of AttentionHTR4 reaches a word accuracy of 91.24%.5 In comparison, Liu et al provide the following scores for the tested LMM on this dataset:

test LMM Score on IAM
BLIP-2 OPT6.7b 38.00
BLIP-2 FlanT5XXL 40.50
OpenFlamingo 45.53
LLaVa 50.40
MiniGPT4 28.90
mPLUG-Owl 42.53
--------------- -----
Supervised SOTA 91.24

The illustrations provided by the article are all of failed attempts, but it corresponds to the overall impression conveyed by the results of the experiments. Indeed, compared to the state-of-the-art supervised methods, zero-shot tasks prompted to LMMs yield results largely outperformed, similar to what is visible in the case of text recognition on the IAM dataset. The only exception is BLIP-2 on a Text Recognition task on a dataset of artistic text (WordArt) which is more challenging. The authors consider that this is a sign that LMMs have a promising potential for visually complex texts.

A very important section of their paper is their remarks on the relationship between LMMs and semantics. Submitting non-word images to the LMMs, they find that the LMMs systematically over-correct the prediction and suggest real-words as an answer. Traditional text recognition approaches, on the other hand, are much less sensitive to the notion of likelihood for the words to recognize. Similarly, the need for semantics interferes with the LMMs' output, and they tend to more easily recognize common words and make up additional letters ("choco" is read as "chocolate"). Lastly, LMMs are insensitive to word length: they are unable to count how many letters are in the image of a word. These results are similar to what Rockenberger experienced with the handwritten letter: the model hallucinated words to compose a semantically plausible letter. But using the wrong date, the wrong names, and the wrong language.

Liu et al conclude their paper reminding us that they experimented with the capacities of the models in the context of zero-shot prompts, whereas there are already successful attempts at fine-tuning LLMs and LMMs on specialized tasks, such as medical prediction. In fact, I think there already exist such attempts in the context of HTR as well: it seems to be the ambition of a model like Transkribus' Text Titan, released at the beginning of the Summer. It is based on a Transformer coupled with an LLM. Unfortunately, I wasn't able to find more information on this model, aside from the community-oriented communications released by Transkribus on their website (here and here).


  1. In stead of a multimodal approach, Salvatore Spina explored the possibility to use a LLM-based tool like ChatGPT3 to post-process the result of HTR and correct the text. See: Spina, S. (2023). Artificial Intelligence in archival and historical scholarship workflow: HTS and ChatGPT (arXiv:2308.02044). arXiv. arXiv.2308.02044

  2. Multimodality is presented by some researchers of the Digital Humanities community as a real epistemological turn for the field. See for example: Smits, T., & Wevers, M. (2023). A multimodal turn in Digital Humanities. Using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections. Digital Scholarship in the Humanities, fqad008. doi: 10.1093/llc/fqad008 ; or Impett, L., & Offert, F. (2023). There Is a Digital Art History (arXiv:2308.07464). arXiv. arXiv.2308.07464

  3. There are a few videos offering more or less detailed explanations on these expressions in the context of prompting an LLM. However, this is not specific to LLM, it is often used in the context of classification or NLP tasks for example. 

  4. Kass, D., & Vats, E. (2022). AttentionHTR: Handwritten Text Recognition Based on Attention Encoder-Decoder Networks (arXiv:2201.09390). arXiv. arXiv.2201.09390

  5. In this case, the WER is used as a baseline to compare different approaches. However, in general, it is not a good idea to only take into account Word accuracy to understand a model's performance in real life. This is something I discussed in this post. 

015 - Block post and comprehensive Exam

When I created this blog last year, I wanted to post regularly on it. Something like once a month or once every other month. I didn't want to put pressure on myself for writing, but I also wanted to make sure that this blog would be alive. I often have ideas for topics for a post. But then, when comes the time to write, I blank out. It's not exactly that I don't know where to start, it's just that I sometimes can't figure out what is the message I want to convey. Like, if I have to summarize my blog post in 2 lines, what's the take-away? I get stuck when I cannot find an answer,but maybe I shouldn't worry that much about it. It's my blog after all, and maybe the message will come by the time I'm done writing.

So, without further ado, let's dive in: I was super excited this Summer after passing my comprehensive exam. I really wanted to write a post about it. I had a really packed Spring and beginning of Summer between going back to Montreal, teaching a class there, attending a Summer school, going to the DH2023 conference in Austria where I presented a short paper, a long paper and organized a workshop (big up to Thibault who was by my sides through all these Austrian adventures). And all of it culminated with that comprehensive exam in the middle of August. I really wanted to share how that went.

But then, vacations, working on new deadlines, more vacations, more deadlines... And now it's already November and I don't know anymore what it was that I wanted to share about that exam. Aside from the fact that I passed it and that it's a pretty big milestone.

The comprehensive examination, which is called "Examen de synthèse" in French, is not something common in France. In France, we now have a sort of yearly evaluation called the "Comité de Suivi Individuel" (or CSI), which is not a scholar evaluation but more of a check-up with your supervisors and a committee1 in charge of making sure that everything is alright. The reason I bring it alongside the Examen de Synthèse is because I also had my first CSI this Summer (at the very end of June). In France, you have to have a positive evaluation from the CSI in order to enroll in a new year of doctoral studies. Each year. But, actually the CSI and the Examen de Synthèse are not really that comparable.

The Examen de Synthèse is a "real" examination and it happens only once during your doctoral curriculum. In my program at the University of Montréal, in 2023, it consisted in several phases.

First of all, there is a phase dedicated to the composition of the jury. I had the pleasure to be examined not only by my three supervisors (Laurent Romary, Emmanuel Chateau-Dutier and Michael Sinatra), but also by Marcello Vitali Rosati, from the University of Montréal, who acted as president, and Maxime Gohier from the University of Quebec in Rimouski. I must signal that my only regret is not to have been able to have a better gender parity in my jury. This is something I really hope to fix for my defense, but I will probably have other occasions to discuss this topic in the future.

So, once the jury is composed, and once a calendar has been agreed on (I think that was actually the most stressful part for me because of all the other things I had this Summer), a count down begins. First, I had to turn in three documents:

  • a 12-15 page-long essay on my research project;
  • a 30-reference long bibliography on the field of the Digital Humanities; and
  • a short presentation of a proposed "practical" analysis.

Then a week later, the jury sent a question.2 I was given 1 week (168h exactly) to think about this question and write a response in the form of a 10-15 page-long essay. The jury had between a week and two weeks to read the response before an oral examination took place (on Zoom).

The oral examination has some similarities with a PhD defense. It started with a 20 minute long presentation that I gave where I summarized my research project (10 minutes) and presented a technical analysis (10 minutes). I chose to focus my technical presentation on an experiment I have been conducting and on which I hope to communicate more in the near future. Then, after my presentation, there were two rounds of questions about my research project, my experiment or about the answer I formulated in my essay.3

I am very happy that such an examination exists in the North American program. It may seem like a lot of stress (and it is), but I found that it is also a very good milestone to progress a lot towards the formalization of a research project. The oral examination is a great opportunity to present a project to people who don't necessarily know what you have been up to before, and it's a really really great occasion to get feedback.

For example, the question that is sent by the jury, in the case of my program, is thought as a way to get you to think about a topic or a question that is either not tackled enough by your research proposal, or it's an invitation to consider new angles. You're not expected to turn in the perfect answer, of course, with barely a week to write it. But it forces you to form an opinion, explore possible hypotheses and may turn later into a whole chapter for your thesis.

The comprehensive exam is a pass/no pass type of examination. There is no grade and if you fail, you can take it a second time. Like I said before at the beginning of this post, I passed. Therefore, starting from Fall 2023, I am now able to enroll as a "en rédaction" student (writing status) which has several consequences. Some seem very symbolic: for example, in English, I can now call myself a PhD candidate instead of a PhD student. But others not so much: tuitions for this new status are much lower than when enrolling as a full-time student, dropping from 1,440$CA/trimester to 512$CA/trimester, and I believe this officially gives me the right to teach at graduate level.

The comprehensive exam also marks the end of the phase during which I had to take courses. Now, with this new status, I am invited to focus solely on the redaction of my thesis, which opens up a whole new chapter for my PhD curriculum.4


  1. I want to take this occasion to also thank Ariane Pinche and Joana Casenave, who were willing to be the members of my committee for the CSI, for their precious feedback! :) 

  2. The question was the following: "Dans votre projet de recherche apparaît une tension importante: celle entre la spécificité des besoins particuliers de chaque projet et la volonté -- et la nécessité -- de produire des approches généralisables, qui puissent être employées dans le cadre de plusieurs projets. En vous appuyant sur votre bibliographie, et en vous concentrant notamment sur le cas du HTR, pourriez-vous analyser cette tension en soulevant en particulier la question de la littératie demandée (notamment dans la gestion des données) pour pouvoir personnaliser des approches computationnelles aussi complexes que les technologies HTR?

  3. I want publish on my blog the documents I created for the comprehensive exam, but I need to find the best way to do it. I'll post an announcement when it will be available. 

  4. Thank you Jennifer for this wonderful pun! ;) 

014 - RT(F)M for the Peraire Experiment

Turns out, there is more to say on last week's experiments on the Peraire dataset! And I found out while I was working on a completely different dataset. Let me explain!

This morning, I helped my colleague train a Kraken transcription model for Greek manuscripts. They gave me the ground truth and I set and executed the training from the command line. It gave me an opportunity to try fine-tuning a model like CREMMA Medieval, in stead of only training from scratch. CREMMA Medieval was trained on manuscripts written in Latin, whereas the Greek manuscripts were written only, well, in Ancient Greek. I didn't want the resulting model to add Latin letters in the transcription when applied to other Greek documents, so I used Kraken's option to allow the model to forget previously learned characters and to force it to only remember the characters contained in the new training data. This option is called --resize (check the documentation here).

When I fine-tune a model, I usually follow Kraken's recommendations and keep both the previously learned characters and the new ones coming from the new set of ground truth. When this morning I checked what is the keyword to use to keep only the characters from the new dataset, I realized that I didn't correctly set the training on Peraire last week. I had set it to only keep the new characters!

Up until Kraken v. 4.3.10, --resize can take the keywords both or add. The ambiguity of these keywords has been discussed in the past, which is the reason why starting from Kraken v. 4.3.10, the keywords respectively become new or union.

Let's quote the manual:

There are two modes dealing with mismatching alphabets, add and both. add resizes the output layer and codec of the loaded model to include all characters in the new training set without removing any characters. both will make the resulting model an exact match with the new training set by both removing unused characters from the model and adding new ones.

I fell for this trap of ambiguity and used both instead of add, thinking both meant I was keep both character sets. (Again this is the very reason why the keywords were recently changed).

Side note: you should really read last week's post to fully understand the rest of this post!

At the end of my post last week, I wrote:

peraire_D on the other hand seems to lose it completely on the B series. This is most likely due to the fact that the contrast between the page and the "ink" is too low in the pencil-written series compared to the data used to train Manu McFrench and in the D series. peraire_D even loses 11 points of accuracy to Manu McFrench!

But how could I be sure that it was not actually due to the fact that the model had unlearned some precious characters?

The only way to know, I thought, was to re-train the models! I used this opportunity to also train the models from scratch because I was curious to see how much noise/improvement was brought by the base model.

I tried 4 types of models and, like last week, used CERberus 🐶🐶🐶 to measure the character error rates on the predictions made on the test sets:

  1. Models trained "from scratch"
  2. A model not trained on any data coming from the Peraire dataset (aka Manu McFrench)
  3. Models obtained from finetuning Manu McFrench using the add resize mode
  4. Models obtained from finetuning Manu McFrench using the both resize mode

For each model trained on the Peraire dataset, I used 3 compositions:

  1. the full dataset ("ALL")
  2. only data coming from the B series ("B")
  3. only data coming from the D series ("D")

I used the same composition system for the test sets.

Here are my results in the form of a table:

a table of the scored obtained on the different train set, test set and resize configurations

Fortunately, it seems that my previous interpretation is not fully contradicted by the results I obtain with this second series of training. Let's focus on two observations:

  1. Whenever a model is trained only on the D series, and tested only on the B series, it appears to be completely incapable of predicting anything but gibberish, losing between 32 and 35 points of accuracy. It confirms that the aspect of the documents from the two series are too different. On the other hand, when the model is fine-tuned on the B series only, it maintains a fairly good accuracy when applied to the D series, whichever resize mode is used. I think it confirms that the B series is enough for the model to learn some sort of formal features from Peraire's handwriting, which the models can transfer to documents written with a different writing instrument.

  2. What is very interesting is the difference between the models trained on the whole datasets and tested on the B series: when we use the both resize mode (meaning we only keep the characters from the new dataset), the model is very good. On the contrary, the performance of the model trained with the add resize mode (meaning we keep the output layer and the codec from the base model and add the new characters) is as bad as with a model trained only on the D series.

In my previous post, I wrote:

peraire_both is able to generalize from seeing both datasets and even benefits from seeing more data thanks to the D series, since it performs better on the B series compared to peraire_B.

However, in the light of my experiment with the resize option, I think this is not correct. Instead, it appears that resetting the output layer by using both (or new) on accident, allowed the model to better take into account the data from the B series (pencil). Contrary to what I observed last week, the model trained on the whole dataset but this time with the add resize mode (or union) doesn't benefit from seeing more data compared to the model trained only on the B series.

My understanding is that keeping the output layer from the base model with add (or union) probably drowns the specificity of the pencil-written documents into a base knowledge tailored to handle documents with a high contrast (like the ones in the D series and in Manu McFrench's training set). Or, to put it differently, when we use both (or new), more attention is given to the pencil written documents, meaning that the model actually gets better at handling this category of data.

I am extremely curious to see how I can investigate this further, or if any of you, readers, would understand these results differently!

013 - The Peraire experiment

WARNING: in my next post, I nuance the conclusions drawn in this post, because of a parameter I didn't correctly set during the training of the models described below. You should really read it after reading this post, to get the full picture!

As a small side project during my phD, I have been sharing my expertise (and a bit of my workforce) with the members of the DIM SPE-VLP project. The acronym stands for "Sauver le patrimoine espérantiste : le voyage de Lucien Péraire (1928-1932)." The project revolves around the digitization, transcription and edition/valorization of Lucien Peraire's archives. He was a French citizen who, in the late 1920s, travelled across the European and the Asian continents, mostly by bike and using Esperanto to communicate. He kept a diary during his journey (and later published a book about his adventures). His notes are written both in French and in Esperanto and in some documents, he also used stenography.

My contribution to the project has mostly consisted in helping developing transcription models for the French diaries (although I'm also interested in the shorthand and the esperanto). This meant both helping with the production of ground truth and training Kraken models. This post will briefly explain how the ground truth was created and published, as well as present the models that were trained with it.

Peraire's notebooks are organized in different series, and each series is divided in ensembles regrouping the pages of a notebook. Each ensemble is named after the countries visited while the notebook was used. For example, notebook 11 in the B series forms one ensemble and covers a part of Peraire's travels in Japan. There are 31 notebooks in the B series. The notebooks of this series are written with a blue pencil on (low quality) school papers. On some pages, the pencil is very faded which makes it hard to read the text, let alone to run a successful segmentation task on the image. On the other hand, the D series gathers notes and comments on the diaries, written at the end of the 1960s. This time the handwriting is much easier to read because Peraire mostly used a blue or black ball-point pen. There are 9 ensembles in this series.

two extracts of Peraire's notebooks side by side, on the left the image is taken from the B series, on the right the image is taken from the D series.

One aspect that I find particularly interesting with this dataset is that we have a case where the handwriting is similar but the writing tool is different. It means that it is possible to explore how the writing tools and/or writing supports affect the efficiency of a transcription model. On top of that, all the documents were digitized under the same (good) conditions and by the same people.

Segmenting, transcribing, aligning and publishing

The first version of the dataset was solely focused on the B series. I selected 1 random page from each ensemble (avoiding to take the first page each time) to compose a train set of 33 files1. On top of that, I selected 4 additional pages from B3, B5, B12 and B18 to compose a fixed test set which would never be used as training data.

I pre-segmented the images with Kraken's default model before correcting the result manually. At this point, I also applied the segmOnto ontology for the lines and regions2. Because of the fading ink, some words could not be transcribed. In order to avoid complicating the transcription rules, I decided to simply segment out the passages that couldn't be read. On the one hand it simplifies the transcription, but on the other hand, it means that a small portion of my segmented documents cannot be re-used by others to train a segmentation model. Since we were not training a segmentation model, it was an easy decision.

screenshot showing the segmentation and the transcription panels from eScriptorium where we can see that some lines are broken down into several segments and that some segments were left blank

More recently, it was decided to augment the dataset with examples from the D series because the model trained on the B series was not good enough. This time, Gilles Pérez, a member of the project, took charge of the transcription. I recommended to create a new sample of 30 to 40 images, so he randomly selected series of 4 continuous pages from each ensemble. The transcription of the corresponding 36 pages was sent to me as a Word document. Therefore, on top of taking care of the segmentation of the images, I also went through an alignment phase during which I verified the order of the lines and copy-pasted the transcription. It took longer than I expected but it allowed me to align the transcription with the rules I had followed when creating the first set. I also picked 4 of the 36 pages to add to the test set.

The dataset is versioned and published applying the principles and tools we developed withing the frame of HTR-United. I also added illustrated segmentation and transcription guidelines.

Testing different dataset configurations to train transcription models

As I mentioned before, the goal of these datasets was to create transcription models. Taking the opportunity of the recent update of the dataset, I tried different scenarios.

I never trained the model from scratch because the dataset is too small to get any sort of usable model. Instead, I used Manu McFrench as a base model, fine-tuned with the Peraire dataset. (We were actually able to use Peraire as an example during the DH2023 conference3 earlier this month to show the usefulness of having this kind of base model). I tested fine-tuning only on the B series, only on the D series or on both the B and the D series. Then I used a B-series-only test set, a D-series-only test set and the full test set to see how the models performed.

Since I wanted to try it after discovering it during DH2023, I used CERberus 🐶🐶🐶 (I talked about it in my last post) to measure the accuracy of the models on the test sets listed above.

Like KaMI, CERberus takes 2 categories of text input: the reference (aka the ground truth) and the prediction (or the hypothesis made by the model). In order to get the prediction, I loaded my models on eScriptorium, as well as the images and transcription of the test set before applying each model to the documents. This way, all the transcription are predicted with the same segmentation, which comes from the ground truth.

Here are the results:

  • Manu McFrench, before fine-tuning, gets a CER of 26.16% when tested on the whole test set, and a score of 27.19% on the documents from the B series, 25.29% on the D series.
  • peraire_both, trained on the B and the D series, gets a CER of 4.63% when tested on the whole test set, but a score of 6.41% on the documents from the B series and 3.54% on the D series.
  • peraire_B, trained only on the B series, gets a CER of 8.72% on the whole test set, but a score of 7.12% on test-B and 9.67% on test-D.
  • peraire_D, trained only on the D series, gets an CER of 16.38% on the whole test set, but this is because of the enormous descripancy between its score on each sub test set. It skyrockets to a CER of 38,53% on test-B while going as low as 3.65% on test-D.

All of this makes sense, though.

  1. ManuMcFrench could not be used without fine-tuning, its error rate on both documents is too high.
  2. peraire_both is able to generalize from seeing both datasets and even benefits from seeing more data thanks to the D series, since it performs better on the B series compared to peraire_B.
  3. peraire_B which was trained on the more difficult dataset seems to use the knowledge inherited from Manu McFrench and to have learned some formal features from Peraire's handwriting since it is able to maintain a fairly low CER on the D series (it gains 16 points of accuracy compared to Manu McFrench).
  4. peraire_D on the other hand seems to lose it completely on the B series. This is most likely due to the fact that the contrast between the page and the "ink" is too low in the pencil-written series compared to the data used to train Manu McFrench and in the D series. peraire_D even loses 11 points of accuracy to Manu McFrench!

What happens with peraire_D is very interesting because it confirms that it is useful to compose a train set with examples of more difficult documents instead of only showing the ones that are easy to read! Now, the nice thing is that I will soon be working on a little experiment with my colleague Hugo Scheithauer where we will be able to measure the impact of the contrast between the ink and the paper. Stay tuned!

EDIT #1: I added the scores obtained by Manu McFrench alone.

EDIT #2: I added a disclaimer at the beginning of the post.


  1. I used 2 images from B2 because one of them was extremely faded and I wanted to include some of these extreme cases in the dataset, and 2 images from B30 because it consisted of shorter lines (table of contents) which I found was interesting to include. 

  2. As described in the documents, I only used the "InterlinearLine" and "DefaultLine" for the lines, and the "MainZone" and "NumberingZone" for the regions. 

  3. See the submission and the slides on HAL: https://inria.hal.science/hal-04094241