A few weeks ago, I began a thorough review of articles published in four digital humanities venues to track mentions of automatic text recognition and understand how, where, and why scholars use it. Although I wish I had started sooner in my doctoral journey, I stay positive holding on to the idea that "it's never too late." I learn a lot about Digital Humanities as a field of research and gain a better understanding of ATR's presence in the field.

While catching up on our dissertation progress, I was telling Roch Delanney about the survey I'm conducting, my goals for it, and how I selected and sorted the articles. Roch suggested that I share my method more widely. It seems a little clumsy at times, but I am also able to use many different skills I have learned and sharpened over the years so I think it is indeed interesting to share a bit of my cuisine.

Perimeter of the literature review

My literature review focuses on four publication venues. I think they are, collectively, representative of research in the Digital Humanities:

Digital Scholarship in the Humanities (DSH), which is presented by the Alliance of Digital Humanities Organizations (ADHO) as an international, peer-reviewed journal published by Oxford University Press on behalf of ADHO and the European Association for Digital Humanities (EADH). It was published under the title Literary and Linguistic Computing: The Journal of Digital Scholarship in the Humanities until 2014. I counted a total of 174 volumes for a total of 1741 articles (excluding retracted articles, book reviews, editorials and committee reports) published since 1985 until the first half of 2025.
Digital Humanities Quarterly (DHQ) is an open-access peer-reviewed journal, probably more representative of research in North America. It is published by the Association for Computers and the Humanities (ACH). I counted a total of 790 articles published since its first issue in 2007. Most articles are in English.
The Journal of Data Mining and Digital Humanities (JDMDH), is published by Episciences since 2017. Contrary to DHQ, its focus is more European-centric, and it has a special volume dedicated specifically to automatic text recognition (directed by Ariane Pinche and Peter Stokes). I found a total of 162 articles published in JDMDH, including the special volume on ATR.
Lastly, the proceedings from the more recent Computational Humanities Research (CHR) conferences (see the 2024 conference proceedings for example) offer a perspective on research focused on more intensively computational methods in the Humanities. The conference is held annually since 2021. I found a total of 214 articles in the proceedings.

Aside from DSH, that I can access thanks to the library of the University of Montréal, all the other journals are in open access.

Collecting the articles and their metadata

For JDMDH, articles are not centralized on the journal website but rather published on platforms like HAL or arXiv and sometimes Zenodo. Getting an overview of the articles published in JDMDH is not straightforward, but it is possible to browse the articles per volumes. I opened and downloaded each article in each volume, as well as collected the article entries in Zotero using the Zotero connector. The process was cumbersome and required many clicks, but the variety of publishing platforms deterred me from writing a script to automate the downloading process.

CHR, on the other hand, was very easy to scrape, partly because there are only four volumes of proceedings so far. For each series of proceeding, the index of all articles is compatible with the batch import scenario of the Zotero connector. To collect the PDFs, I used a section of the HTML page and regular expressions to identify the links to the PDF files, creating a list of URLs. Finally, I used a Python script to download the PDFs to my computer.

For example, in https://ceur-ws.org/Vol-2989/, the ul contains simple HTML elements pointing to the PDF files, such as:

<h3><span class="CEURSESSION">Presented papers</span></h3>

<ul>
  <li id="long_paper5"><a href="long_paper5.pdf">
      <span class="CEURTITLE">Entity Matching in Digital Humanities Knowledge
      Graphs</span></a>
    <span class="CEURPAGES">1-15</span> <br>
    <span class="CEURAUTHOR">Juriaan Baas</span>,
    <span class="CEURAUTHOR">Mehdi M. Dastani</span>,
    <span class="CEURAUTHOR">Ad J. Feelders</span>
  </li>
...

All I had to do was copy and paste this entire list into a text editor (I like to use Sublime Text in such a situation). Then, I used a simple regular expression like href=".+?" to select the value in the a element, which contains the links to the PDF files. I kept only the selected text and then rebuilt the complete URL with a couple of replacements such as href=" -> "https://ceur-ws.org/Vol-2989/ and "\n -> ",\n. At this point I just added square brackets around the selection, et voilà! I had a Python list ready to be passed to a script like the one below to download the files:

list_of_urls = ["https://ceur-ws.org/Vol-2723/short8.pdf",
                "https://ceur-ws.org/Vol-2723/long35.pdf",
                "https://ceur-ws.org/Vol-2723/long44.pdf",
                #...
                ]

import requests
import os
from tqdm import tqdm # it makes  progress bar so I know how long I can take to make a tea while the script runs

for url in tqdm(list_of_urls):
    r = requests.get(url)
    if r.status_code == 200:
        filename = f"{url.split("/")[-2]}-{url.split("/")[-1]}"
        #print(filename)
        with open(filename, "wb") as f:
            f.write(r.content)
    else:
        print(f"Failed to download: {url}")
    time.sleep(1)  # This cool down is to be polite to the server

I used a similar approach for downloading the articles from DHQ because the Index of Titles lists all of the published articles on a single page. I first downloaded the HTML pages of the articles (DHQ publishes articles in HTML format as well as PDF). I also used regular expressions to extract the list of links and used a Python script to download the files.

Unfortunately, the Zotero connector only works on each article page individually, but not for batch-import on the index page. I investigated a bit to understand why it was so, and found that in the source code of each article page, there is a span element identified with the class Z3988 that the Zotero connector uses to extract the metadata and create an entry in Zotero. In DHQ, these spans look like this:

<span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft.genre=article&amp;rft.atitle=Academics%20Retire%20and%20Servers%20Die%3A%20Adventures%20in%20the%20Hosting%20and%20Storage%20of%20Digital%20Humanities%20Projects&amp;rft.jtitle=Digital%20Humanities%20Quarterly&amp;rft.stitle=DHQ&amp;rft.issn=1938-4122&amp;rft.date=2023-05-26&amp;rft.volume=017&amp;rft.issue=1&amp;rft.aulast=Cummings&amp;rft.aufirst=James&amp;rft.au=James%20Cummings"> </span>

I understood recently, while discussing with Margot Mellet, that Z3988 is a reference to the OpenURL Framework Standard (ISO Z 39.88-2004), which is used by the Zotero connector. Also, I should note that such spans are not systematically used in online journals. JDMDH for example doesn't use them, and serves the metadata in a different way.

Since I had already downloaded all the articles from DHQ as HTML files, I wrote a simple Python script that found all of such spans for each downloaded article and aggregated them in a single, very simple HTML file. Then, I simply opened this page in my browser after emulating a local server¹ (with a command like python -m http.server), and I was able to use the Zotero connector to import all the articles in a single click. It was very satisfying! The only downside is that I couldn't collect the articles' abstracts because there weren't included in the spans.

DSH was different from the rest of the journals. Because of the longevity of the journal and the amount of articles it published, it was quite overwhelming. Unfortunately, it is a paywalled journal and I couldn't figure out how to make the proxy of the University of Montreal library work with my Python scripts and the command line. As a result, I had to manually download the articles,² but only when they were relevant! Since DSH has a fairly good search engine that allows to do multi-keyword searches, I only downloaded articles matching my search criteria (143 in total).

Additionally, I went through each of the 174 issues of DSH to batch-import the article references in Zotero. It was tedious but I figured I might be able to use these metadata for other projects in the future.

Filtering the articles

For DHQ, JDMDH and CHR, I ran a keyword surch using the command grep on the content of the articles. I didn't want to limit my search to the titles, abstract or keywords because I really wanted to include anecdotal mentions of automatic text recognition in my results.

To use grep, I created a file (pattern.txt) with the keywords I was looking for:

HTR
OCR
text recognition
ATR
Transkribus
eScriptorium
automatic transcription

Then I converted the PDFs into text files using the command pdftotext. This was necessary because grep cannot search inside a PDF directly. I didn't need to do this conversion for DHQ, since I had download HTML files from that journal.

The commands to search inside the PDFs of one of the journals would look like this:

ls *.pdf | xargs -n1 pdftotext # to convert PDFs to text files
grep -i -w -m5 -H -f ../pattern.txt *.txt # to search for the keywords in the text files and display the first 5 matches

After controlling how grep matched the keywords, I used grep -l -f ../pattern.txt *.txt to list the files that matched the keywords. This list was used to sort the documents into two folders, according to whether or not they matched my research.

In the case of DSH, I directly used the search engine to combine the keywords, using the "OR" operator. I set the full text of the articles as the scope of my research: https://academic.oup.com/dsh/search-results?allJournals=1&f_ContentType=Journal+Article&fl_SiteID=5447&cqb=[{%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22automatic%20transcription%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22transkribus%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22text%20recognition%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22escriptorium%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22OCR%22}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22HTR%22}]}]&qb={%22_text_1-exact%22:%22automatic%20transcription%22,%22qOp2%22:%22OR%22,%22_text_2-exact%22:%22transkribus%22,%22qOp3%22:%22OR%22,%22_text_3-exact%22:%22text%20recognition%22,%22qOp4%22:%22OR%22,%22_text_4-exact%22:%22escriptorium%22,%22qOp5%22:%22OR%22,%22_text_5%22:%22OCR%22,%22qOp6%22:%22OR%22,%22_text_6%22:%22HTR%22}&page=1

In both cases, the search was not case sensitive, in order to catch a maximum of occurrences of keywords like "automatic text recognition" or "Text Recognition" or "text recognition", etc. However, it meant that sometimes I found false positives: "democracy" often matches with "ocr", so does "theatre" with "atr". Since DSH's search engine returns the match in context, I was able to ignore these false positives. For the other journals, I had to manually check where the matches were. Usually, I combined this control with the next step of my investigation.

Hits per journal

JDMDH: 47 hits (out of 162 articles)
DHQ: 93 hits (out of 790 articles)
DSH: 143 relevant hits (out of 1741 articles)
CHR: 65 hits (out of 214 articles)

Dépouillement and analysis

To this date, I am still in the process of reading the articles and taking notes on the occurrences of my keywords.

I use Zotero to keep track of the articles I read and to confirm whether they are false positives. Sometimes, I leave out articles that are irrelevant, even if they mention a keyword I was looking for. For example, Liu & Zhu (2023)³ contains the string "OCR" but it only appears in a title in their bibliography, for work they refer to in a context where OCR is not relevant to their argument. With tags in Zotero, I clearly identify such articles as "to be left out" from my analysis, but I don't remove them from the collection.

I use different tags to identify the various occurrences of the technology in the articles. For example, I distinguish between firsthand applications of ATR and the reuse of data produced by ATR before the experimentation presented by the authors. Typically, there are many mentions of documents that were OCRed by libraries and used by scholars to conduct their research. Overall, with this analysis, I am trying to add more depth to the observations made by Tarride et al (2023)⁴ in which they pragmatically considered three situations leading to the use of ATR: 1) for the production of digital editions; 2) for the production of large searchable text corpora; and 3) for the production of non-comprehensive transcriptions to feed knowledge bases. However, it is difficult to elaborate definitive categories before I am done processing all the collected articles.

Due to the large number of articles to be analyzed, I have continued to use the grep command to quickly review the content of articles and speed up my sorting process. For example, I am more interested in firsthand usages of ATR, want to be able to quickly identify non relevant mentions of my keywords as was the case in Liu & Zhu (2023). The command grep -i -w -C 5 -H -f ../pattern.txt *.txt > grep_out allows me to generate a file, grep_out, in which, for each time a keyword is matched in a document, five lines of context are displayed before and after the match, as well as the name of the file. I still have to read the abstracts and parts of the articles to clearly understand in which contexts the automatic text recognition technologies are used. However, this is an effective method for quickly sorting through the articles.

I'm looking forward to sharing the results of this analysis in my dissertation!

This emulation is necessary to allow the Zotero connector to work properly. ↩
I want to specify here that it was not by lack of reading documentations on proxies and requests. Unable to find a straightforward solution, unsure if it was even something that the UdeM proxy allowed, and because I would have still needed to write additional scripts afterwards, I decided that it would take just as long to do it manually (about 2-3 hours). ↩
Liu, Lei, and Min Zhu. "Bertalign: Improved Word Embedding-Based Sentence Alignment for Chinese–English Parallel Corpora of Literary Texts." Digital Scholarship in the Humanities 38, no. 2 (June 1, 2023): 621–34. https://doi.org/10.1093/llc/fqac089. ↩
Tarride, Solène, Mélodie Boillet, and Christopher Kermorvant. "Key-Value Information Extraction from Full Handwritten Pages." arXiv, April 26, 2023. https://doi.org/10.48550/arXiv.2304.13530. ↩

023 - Writing a PhD manuscript with Markdown and Quarto

Alix Chagué

2025-02-18

The deadline for finishing the dissertation is approaching. And there is still so much to do! This is one of the main reasons why this research blog has been quiet for the last few months, even though there are many topics I would like to write about.

But I guess I can take a short break from time to time and go with the flow of writing a blog post in one sitting. Who knows, maybe I'll do a few more before it's time to turn in my dissertation. I want to talk about my writing setup because it is something I have thought about a lot, trying to find the best compromise.

Writing my dissertation in Microsoft Word has never been an option, although I do use Google Docs from time to time to get quick feedback from my supervisors.

LaTeX may seem like an obvious choice to some of my fellow PhD writers, but I usually limit my use of LaTeX to Overleaf, an online LaTeX editor. On the one hand, I didn't necessarily want to install LaTeX locally for the time being, and on the other hand, I couldn't imagine writing a whole dissertation using Overleaf, because working in my browser can be distracting, and because it would require that I always have access to the Internet to work. To be honest, I mostly didn't want to use LaTeX in the first place because I find the syntax too distracting when I'm writing. It's super useful for getting good control over the layout of the document for the final version of the manuscript, but it's not convenient to work with while I'm formulating ideas and arguments.

I will probably use LaTeX to prepare the final version of the manuscript, but I wanted to use something lighter to structure my document, but easily convertible to LaTeX down the road.

And I am a big fan of Markdown.

Markdown has a syntax that is light enough not to be too distracting - I use it all the time when taking notes anyway, so it is fully part of my writing reflexes. Also, in the context of writing my dissertation, I think of Markdown as text that I can easily copy and paste into a Google document when I need feedback, without losing formatting and without compromising readability in Google Docs. I've seen some LaTeX copy-pasted into Google Docs for supervisor feedback, and I don't think it would work for me.

In addition to Markdown, I wanted to be able to use a modular approach to building my manuscript. A modular approach means having several smaller text files that are eventually merged into a single master document. LaTeX also relies on modularity with commands like \include{}. Modularity is important because in a very long text document it is easy to get lost between inline comments, draft passages, and finished paragraphs. There's also the risk of accidentally deleting passages. With a modular structure, it will also be easier to move paragraphs around as I progress. Also, my manuscript is versioned with Git and synchronized with a private GitHub repository, and modularity makes versioning much easier.

Instead of programming my own manuscript builder - yes, that was my first impulse - I took a closer look at the documentation for Quarto, which I've been using for a little over a year to create slides and websites for the courses I teach. Quarto offered me a solution on a silver platter, because it supports building books with Markdown, which is close enough to a phD thesis.

Quarto implements a single-source publishing paradigm and acts as a shell around pandoc, which allows for swift conversion from one format to another, including from Markdown to LaTeX. I can split the document into multiple smaller Markdown files, and use my book's config file to specify the order in which the Markdown files are aggregated. Quarto's Markdown implementation includes some cool stuff from pandocs, including citation and cross-reference management. It's really worth taking a look at the documentation.

So with Quarto, I can write my dissertation as a series of smaller Markdown files, and end up with a master .md file, a .tex file ready to import into Overleaf, or even an already parsed PDF file generated with tinytex.

Quarto is not a text editor, it is simply a processor that starts with a set of markdown files and a config file, and then builds one or more outputs. To write, I use Visual Studio Code and have a quarto preview command running in the background. For now, it just produces an HTML preview that I see in my browser. When I'm closer to a stable version of the manuscript, I'll start working with PDF output.

The syntax for some of the more specific Markdown features in Quarto is more complex than I am used to, so I still have to look at the documentation from time to time. But I am getting the hang of it, and I use a cheat sheet for the features I use more often.

Pandoc's Markdown support lets you apply classes to entire paragraphs or inline portions of text. This is useful because it has allowed me to create some CSS transformations with classes like "draft" or "missing-information" to keep track of passages I need to rewrite, or blocks where I need to get away from my text editor and go back to my notes (usually in Zotero). I find it super useful to avoid (at least as much as possible) falling into loopholes that distract me from actually writing. It's more efficient for my time management to divide my time between actual writing sessions and other sessions where I work on improving the drafty passages or doing the research I'm missing to illustrate an argument.

Another use of inline classes is to keep track of concepts or specific terms that I could include in a glossary or at least a list of acronyms. By keeping track of them directly in the text, I can automate the generation of these sections. Some might say that this is the kind of thing I could do with TEI XML- I agree, since this is semantic annotation. But as I said, I wanted a lightweight syntax, and I really like Markdown.

EDIT from June 20, 2025: I feel the need to add a precision a few months after this original post: while I did like my set up with Markdown and Quarto to get started on writing my dissertation, I eventually switched to the good old LaTeX. Quarto/Markdown simply lacked too many features for what I wanted to do.

Part of the problem came from the fact that custom annotations that turn into spans with custom classes during a Markdown-to-HTML transformation scenario were not converted into anything in LaTeX and were therefore lost. For example, I would have needed to manage the glossary and acronym handler afterwards, only once I was done with Markdown and fully switched to LaTeX. Rather than writing my own preprocessing script to find a solution to this problem (as far as I could see, pandoc does not offer any option to map markdown spans to custom LaTeX commands), I figured swithing to writing in LaTeX directly made more sense: there was no point in pushing too far the complications.

Also, I really wanted to be able to use the todo package from LaTeX to keep track of feedback, side notes and questions I had for myself while writing. With this package, they are visible in the PDF output, which is useful also when I share my text with other people.

Lastly, Roch Delanney greatly facilitated this switch by sharing his LaTeX template with me. It was easy to start from the setup he created with Robert Alessi and to add my own configuration and customization. Their template was much more pure than templates that can be found on Overleaf, on top of being very well documented. It was great to keep things simple: I don't import any package that I don't actually need.

022 - McCATMuS #5 - Training models

Alix Chagué

2024-09-23

Last week, I visited Rimouski in the Bas-Saint-Laurent region of Québec, along the South-eastern bank of the St Laurent river. I was invited to contribute to discussions around the Nouvelle-France Numérique project, and I took this opportunity to present HTR-United, CATMuS as well as preliminary results on training a McCATMuS model. In preparation for this presentation, I conducted a series of tests on the two first models I trained. Today, this blog post gives me a space to discuss these tests and their results in more details.

The Kraken McCATMuS models were not directly trained on the HuggingFace dataset I introduced in my previous post, but rather on ARROW files created with the same ALTO XML files used to create the HuggingFace dataset. At the beginning of September, I wrote a Python script which reproduces the split of ALTO XML files into the train, validation and test sets, and which applies the same type of filtering of lines and modifications as I previously presented. Instead of generating the PARQUET files for HuggingFace, it simply creates alternative .catmus_arrow.xml files and three listings of these files, ready to be served to a ketos compile command¹.

I used Kraken 4.3.13 to train the models on Inria's computation server because I've had dependency issues with Kraken 5 and haven't fixed them yet. The first model I trained strictly followed the train/validation split thanks to the --fixed-splits option. After 60 epochs, the model plateaued at 79.9% of character accuracy. When applied to the test set, this accuracy remained at 78.06%, a mere two points drop.

I trained a second model using the same parameters² but without the --fixed-splits option, allowing Kraken to shuffle the train set and the validation set into a 90/10 split (the test set was left untouched however). This time, the training lasted 157 epochs before stopping, with the best model scoring with an accuracy of 92.8% on the validation set. When applied to the test set however, the model lost 7 points of accuracy (85.24%).

Learning curve for the model trained on the fixed split. — Learning curve (Character and Word Accuracies) for the model trained on the fixed "feature"-based split between train and validation.

Learning curve for the model trained on the non-fixed split. — Learning curve (Character and Word Accuracies) for the model trained on the random split between train and validation.

Although disappointing, this was consistent with the observations made when training the CATMuS Medieval model:

As anticipated, the "General" split exhibits lower CER, given the absence of out-of-domain documents, whereas the "Feature"-based split surpasses 10%. This higher score presents an intriguing challenge for developing more domain-specific models that consider factors such as script type and language. (from Thibault Clérice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagué, Jean-Baptiste Camps, et al.. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. 2024 International Conference on Document Analysis and Recognition (ICDAR), 2024, Athens, Greece. ⟨hal-04453952⟩ p. 15)

So, the drop in accuracy observed on the test set is, as suggested in Clérice et al, 2024, likely due to the fact that with a fixed-split, the model is both validated and tested against out-of-domain hands and documents (although the documents differ in the two sets). On the other hand, the model trained with a random split is validated against known hands and documents, but tested on out-of-domain examples.

The test set contains transcriptions of printed, typewritten and handwritten texts, covering all centuries. Limiting ourselves to only one accuracy score obtained on the whole test set would tell us very little about the model's capacity and its limitations. This is why I divided the test set into several smaller test sets based on the century of the documents and/or on the main type of writing present in the documents. For documents spanning over several centuries, I used the most represented century.

I only used the McCATMuS trained on the random split for these tests, because the accuracy of the other one was too low for the results to be meaningful. Instead of only testing McCATMuS, I also ran the Manu McFrench V3 and McFondue on the McCATMuS test set. They are two generic models trained on similar data (although with no or different normalization approaches).

Test set..............	...McCATMuS...	...Manu McFrench V3...	...McFondue
All...................	...85.24...	...91.17...	...76.12
Handwritten...........	...78.72...	...89.40...	...75.17
Print.................	...96.37...	...94.15...	...78.30
Typewritten...........	...90.93...	...92.69...	...58.13
17th cent.............	...87.27...	...86.39...	...72.81
18th cent.............	...88.65...	...94.21...	...81.64
19th cent.............	...79.81...	...93.70...	...75.46
20th cent.............	...74.92...	...86.52...	...56.74
21st cent.............	...73.86...	...90.20...	...68.04
(HW) 17th cent........	...58.69...	...64.83...	...64.26
(HW) 18th cent........	...85.38...	...93.35...	...80.47
(HW) 19th cent........	...79.81...	...93.70...	...75.46
(HW) 20th cent........	...63.02...	...82.23...	...55.89
(HW) 21st cent........	...73.86...	...90.20...	...68.04

I was initially surprised by the consistent margin Manu McFrench had over McCATMuS, considering it was trained on less data (73.9K + 8.8K lines, against the 106K + 5.8K lines) which had not been harmonized to follow the same transcription rules. However, these scores are actually biased in favor of Manu McFrench because several of the documents included in the McCATMuS test set were also used in Manu McFrench's train set. Even though this is not true for all documents, it concerns almost half of the test set. It might also be the case for McFonddue, but this model scores higher than McCATMuS in only one instance (handwritten documents from the 17th century). Creating a new test set, with documents that are not present in any of the train sets but follow the CATMuS guidelines, would be a good way to confirm this bias.

Additionally, I detected an issue in one of the datasets used in the test set: FoNDUE_Wolfflin_Fotosammlung contains some lines of faulty transcriptions, resulting from automatic text recognition, which most certainly cause an inaccurate evaluation of all three models.

A couple of examples of the faulty transcriptions, along with their CER they generate when compared to what would be a correct transcription (the CER is generated with CERberus):

Line image Faulty transcription Correct transcription Faulty CER would be

"CSTITHER, KIESERMAEAER AogS." "COLLECTION HANFSTAENGL LONDON" 89.29

"PEcLioL." "NATIONAL GALLERY" 175.0

Line image	Faulty transcription	Correct transcription	Faulty CER would be
	"CSTITHER, KIESERMAEAER AogS."	"COLLECTION HANFSTAENGL LONDON"	89.29
	"PEcLioL."	"NATIONAL GALLERY"	175.0

I have planned to manually control this dataset and update the McCATMuS dataset accordingly. I don't know yet how many lines are affected.

The better accuracy of the Manu McFrench model is not just a product of the biases in the test set. I had the occasion to apply it to two documents, one from the 17th century and one from the 20th century. In both cases, Manu McFrench's transcription seemed more likely to be correct than McCATMuS's. This has led me to compare the training parameters used for both models and to start a third training experiment using Manu McFrench's parameters. In this case, the batch size is reduced to 16 (as opposed to 32) and the Unicode normalization follows NFKD instead of NFD.

If the results of this third training are consistent with the previous experiments, it will be interesting to see if adding more data to the training set will improve the results. Also, I have yet to test the model in a situation of finetuning.

As said at the beginning of this post, these results are preliminary, so I hope to have more to share in the coming weeks.

The command looks like this: cat "./list_of_paths.txt" | xargs -d "\n" ketos compile -o "./binary_dataset.arrow" --random-split .0 .0 1.0 -f alto. ↩
The configuration of Kraken for training these two model relies on the default network architecture, on a NFD Unicode normalization, a learning rate of 0.0001 (1e^-4), batch size of 32, padding of 16 (default value), and applies augmentation (--augment). The --fixed-splits option is used for the first model. Following Kraken's default behavior, the training stops when the validation loss does not decrease for 10 epochs (early stops); this prevents the model from overfitting, which is confirmed when looking at the accuracy score of the intermediary models on the test set (orange line on the graphs). The training is done on a GPU. ↩

021 - McCATMuS #4 - Cleaning data, collection metadata

Alix Chagué

2024-08-30

Preparing the data for CATMuS would certainly have taken much more time had I not been able to benefit from Thibault Clérice's experience with CATMuS Medieval. Not only was I able to build on the workflow he set up when he built it, but I also relied heavily on his scripts to parse and build the final dataset into PARQUET files that were pushed to HuggingFace. Most of these steps are described in Thibault Clérice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagué, Jean-Baptiste Camps, et al.. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. 2024 International Conference on Document Analysis and Recognition (ICDAR), 2024, Athens, Greece, presented at the ICDAR conference in Athens in a few days.

For McCATMuS, I started by downloading all the datasets (keeping track of the official releases) then I manually reorganized all the datasets so that the transcription and images were always under {dataset_repo}/data/{sub_folder}, which made later manipulation easier. Based on the notes I took while filtering the datasets, and after generating a character table for each dataset with Chocomufin, I created several conversion tables to harmonize the transcription. The conversions are a mix of single character or multiple character replacements ([ and [[?]]) and more or less sophisticated replacements based on regular expressions (#r#«).¹

Here is a sample of the Chocomufin conversion table used for the LECTAUREP datasets. If the character is replaced by itself, it remains unchanged in the dataset, while replacing it allows either to remove a character from the dataset (the ¥) or to harmonize its transcription with the CATMuS guidelines (see œ and ° for example).

char,name,replacement,codepoint,mufidecode,order
#r#« ,Repl extra space before LEFT-POINTING DOUBLE ANGLE QUOTATION MARK,"""",00AB,,0
#r# »,Repl extra space before RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK,"""",00BB,,0
[[?]],replace [[?]] with ⟦⟧,⟦⟧,,,0
[?],replace [?] with ⟦⟧,⟦⟧,,,0
),RIGHT PARENTHESIS,),0029,),
m,LATIN SMALL LETTER M,m,006D,m,
É,LATIN CAPITAL LETTER E WITH ACUTE,É,00C9,E,
a,LATIN SMALL LETTER A,a,0061,a,
",",COMMA,",",002C,",",
e,LATIN SMALL LETTER E,e,0065,e,
^,CIRCUMFLEX ACCENT,^,005E,^,
œ,LATIN SMALL LIGATURE OE,oe,0153,oe,
̂,COMBINING CIRCUMFLEX ACCENT,̂,0302,,
W,LATIN CAPITAL LETTER W,W,0057,W,
°,DEGREE SIGN,^o,00B0,*,
¥,YEN SIGN,,00A5,,
½,VULGAR FRACTION ONE HALF,1/2,00BD,0.5,
h,LATIN SMALL LETTER H,h,0068,h,
r,LATIN SMALL LETTER R,r,0072,r,
æ,LATIN SMALL LETTER AE,ae,00E6,ae,
ȼ,LATIN SMALL LETTER C WITH STROKE,c,023C,c,
∟,RIGHT ANGLE,,221F,[UNKNOWN],

It wasn't possible to use a single conversion table for all the datasets because some had different transcription approaches. While replacing ¬ with - could, in principle, be used for each dataset, normalizing the way corrections and uncertainties were transcribed was another story. For example, in some of the CREMMA datasets, >< is used to signal a crossed word, while in other datasets <> is used. So replacing > with ⟦ and < with ⟧ in >hello< meant that in some cases we would successfully get ⟦hello⟧, while in other cases we would end up with ⟧hello⟦. There are a few documents where I had to manually intervene in the XML file to fix the transcription. In such cases, I fork the dataset repository to keep track of the corrected version of the ground truth or I push the correction back into the original dataset to create a new, more consistent version.

In general, the converted dataset is saved as .catmus.xml files, which allows us to keep track of the original ground truth and easily adjust the conversion table later if necessary afterwards.

In the second post of this series, I mentioned that "the CATMuS guidelines can (should?) be used as a reference point" and that "if a project decides to use a special character to mark the end of each paragraph, then in order to create a CATMuS-compatible version of the dataset, I should only have to replace or remove that character. In such cases, the special character that was chosen should be unambiguous and the rule should be explicitly presented." Providing a Chocomufin conversion table along with a dataset that uses project-specific guidelines would be an excellent practice to ensure that the dataset is indeed compatible with CATMuS.

Once all the .catmus.xml files were ready, I created a new metadata table for McCATMuS listing all the subdirectories under each dataset's "data" folder. This table was used as a basis to start collecting additional metadata at the document level rather than at dataset level, like the language used in the source or the type of writing (printed, handwritten or typewritten). Working at the document level is important because some dataset contain different types of writing and/or are multilingual. In some cases, when a document would mix different languages and/or different types of writing in the source, if the distinction could be made at the image level, I manually sorted them and created two different subfolders. This is what I did in the "Memorials for Jane Lathrop Stanford" dataset, for example: the subfolder "PageX-LettreX" mixed typewritten and handwritten letters, so I sorted them into "PageX-LettreX-handwritten" and "PageX-LettreX-typewritten" in order to have the most accurate metadata possible.

Other metadata included the assignment of a call number (or shelf mark) for each source represented in the datasets. In some cases a call number may apply to multiple subfolders, but in most cases, each subfolder is de facto a different document. Retrieving the call number is useful for several reasons: it allows for an accurate assessment of the diversity of documents in McCATMuS, it allows for a document to be associated with additional metadata found in its institution's catalog, or the list of call numbers can be used during benchmarking or production to check whether a document is known to the models trained on that dataset, thus explaining potentially higher accuracy scores.

In the few cases where the source used to build the ground truth did not have a corresponding call number, I simply made one up, keeping "nobs_" as a signal that it was a made-up call number. Thus, if "cph_paris_tissage_1858/" in "timeuscorpus" is now associated with its corresponding call number at the Paris archive center (Paris, AD75, D1U10 386), CREMMAWiki's "batch-04", which is composed of documents we created for the project, is associated with a made-up call number: "nobs_cremma-wikipedia_b04".

In the end, when the PARQUET files are created, the metadata from the table I just presented is collected, along with information extracted from parsing the contents of the XML file. Each of the metadata is then represented at the text line level. If you compare McCATMuS with CATMuS Medieval using HuggingFace's dataset viewer, you can see that they don't use exactly the same metadata.

"Language", "region type" and "line type" (which are based on the segmOnto classification), "project" and "gen_split" are common to both datasets, along with "shelfmark" I just described above. They both have a "genre" column with similar values (treatise, epistolary, document of practice, etc.). In the case of CATMuS Medieval, "genre" is complemented by "verse" (prose, verse).

Following Thibault's advice, I defined the creation date of a text line using two numbers ("not_before" and "not_after") instead of a single "century" value. This allows for a precise dating when it is possible or on the contrary, to spread the dating over several centuries when it cannot be avoided, which is more accurate in both cases.

McCATMuS mixes printed, handwritten and typewritten documents, so it was important to have a "writing type" column to help filter the dataset based on this information, in cases where one does not want to mix them. This metadata also makes it possible to use McCATMuS to train a classifier capable of distinguishing between the different types of writing. CATMuS Medieval on the other hand presents only handwritten sources, so such a metadata would be useless and is able to rely on paleographic classifications to characterize each text line based on a "script type" metadata, that includes values such as "caroline", "textualis", "hybrida", etc.

McCATMuS also has a "color" column that helps sort text lines based on whether the source image is colored (true) or in grayscale (false).

Although I reused the scripts developed by Thibault to build this dataset, I had to make several modifications to include this new metadata in the PARQUET files and to add additional filtering to the text lines. This included updating the mapping to the segmOnto vocabulary to match what existed in my datasets, or filtering some types of lines such as those identified as signatures.² I also included an update of "writing_type" at the line level whenever the value in "line_type" allowed it to be controlled.

if ":handwritten" in line_type:
    writing_type = "handwritten"
    line_type = line_type.replace(":handwritten", "")
elif ":print" in line_type:
    writing_type = "printed"
    line_type = line_type.replace(":print", "")
elif ":typewritten" in line_type:
    writing_type = "typewritten"
    line_type = line_type.replace(":typewritten", "")
else:
    writing_type = metadata["writing_type"]

In the end, having built such a dataset (the first version of McCATMuS contains 117 text lines!) with such a variety of metadata is very satisfying although there is room for improvement. I have already mentioned that it would be interesting to have a greater variety of languages in McCATMuS. I also know that some of the values in "writing_type" are not completely accurate so adding a control based on a classifier might be interesting. Finally, I've noticed that some transcriptions in the "FoNDUE_Wolfflin_Fotosammlung" dataset are not correct at all, probably due to an automatic transcription that wasn't corrected.

However, before we dive into improving McCATMuS, it's important to first examine the accuracy of the models that can be built on top of it! This will be the topic of the next and last post in this series!

To learn more about how chocomufin convert works, just read the software's short documentation. ↩
I don't think it makes sense to include signatures in a dataset to train a generic model, since the transcription of such lines can be very context specific. ↩

020 - McCATMuS #3 - Datasets selection

Alix Chagué

2024-08-29

HTR-United made identifying candidate datasets for McCATMuS a piece of cake. Once the rest of the CATMuS community agreed with the period to be covered by a "modern and contemporary" dataset, I created a simple script to parse the content of the HTR-United catalog and make a list of existing datasets covering documents written in Latin alphabet and matching our time criteria.

Actually, here is the script!

url_latest_htrunited="https://raw.githubusercontent.com/HTR-United/htr-united/master/htr-united.yml"

import requests
import yaml

import pandas as pd

# get latest htr-united.yml from main repository
response = requests.get(url_latest_htrunited)
catalog = yaml.safe_load(response.content)

def in_time_scope(dates):
    century_scope_min = 1600
    century_scope_max = 2100
    # this means that we allow datasets that intersect with the period
    if int(dates.get("notBefore")) < century_scope_min and int(dates.get("notAfter")) < century_scope_min:
        return False
    elif int(dates.get("notBefore")) > century_scope_max and int(dates.get("notAfter")) > century_scope_max:
        return False
    return True

filtered_by_date = []
for entry in catalog:
    if in_time_scope(entry.get("time", {})):
        filtered_by_date.append(entry)
print(f"Found {len(filtered_by_date)} entries matching the time scope.")

targeted_script = "Latn"
filtered_by_script = []
for entry in filtered_by_date:
    if targeted_script in [s.get("iso") for s in entry.get("script")]:
        filtered_by_script.append(entry)
print(f"Found {len(filtered_by_script)} entries matching the script criteria.")

cols = ["Script Type", "Time Span", "Languages", "Repository", "Project Name", "Dataset Name"]

metadata_df = pd.DataFrame(columns=cols)

selected_entries = filtered_by_script
for entry in selected_entries:
    row = {k:"" for k in cols}
    languages = [l for l in entry.get("language")]
    if len(languages) == 1:
        row["Languages"] = languages[0]
    elif len(languages) > 1:
        row["Languages"] = ", ".join(languages)
    else:
        print("Couldn't find a field for language in this repository")
        row["Languages"] = "no language"
    # get centuries/y
    row["Time Span"] = f'{entry.get("time").get("notBefore")}-{entry.get("time").get("notAfter")}'
    row["Project Name"] = entry.get("project-name", "no project name")
    repository = entry.get("url", "no url found")
    if repository.startswith("https://github.com/"):
        row["Repository"] = repository.split("https://github.com/")[-1]
    elif repository.startswith("https://zenodo.org/"):
        row["Repository"] = repository.replace("https://zenodo.org/", "zenodo:")
    else:
        row["Repository"] = repository
    row["Dataset Name"] = entry.get("title", "no title found")
    script_type = entry.get("script-type")
    if script_type == "only-typed":
        row["Script Type"] = "Print"
    elif script_type == "only-manuscript":
        row["Script Type"] = "Handwritten"
    else:
        row["Script Type"] = "Mixed"
    metadata_df.loc[len(metadata_df)] = row

metadata_df

I saved the output as a CSV and proceeded to go through each of the selected datasets and its metadata. I checked several things:

I made sure the datasets were available and easy to download. For example, I excluded those requiring manual image retrieval.
I checked the format of the data because I decided to initially focus only on datasets available in ALTO XML and PAGE XML.
I controlled the overall compatibility between the transcription guidelines used for the dataset and those designed by CATMuS.
I also checked the conformity of the dataset when trying to import it into eScriptorium. This import allowed me to detect when there was a discrepancies between the names of the image files and the value for the source image in the XML file which prevented the import from successfully running.¹
Loading a sample of the dataset in eScriptorium also allowed me to visually control other incompatibilities with CATMuS that may not have been documented by the producers of the data.²
Finally, I considered the structure of the repository and, when necessary, the facility to reorganize it into a single data/ folder containing the images and the XML files, often distributed among sub-folders.

I assigned each dataset a priority number from 1 to 6. The lowest number was for dataset compatible with CATMuS without any modification (no dataset was giving a priority rank of 1...) and 6 for massive datasets that would require a nerve-racking script to be built correctly. My grading system is shown below.

1=ready as is
2=need to be chocomufin-ed
3=require manual corrections but the dataset is very small, or the dataset is chocomufin/catmus compatible but requires a script to build it
4=require manual corrections but the dataset is relatively big, or require a script to be fixed
5=require manual corrections but the dataset is really big
6=require manual corrections but the dataset is really big and require a personalized script to be built

For example, "Notaires de Paris - Bronod" had to be modified to comply with CATMuS requirements. This included replacing [[ and ]] with ⟦ and ⟧, or also to ignore lines containing ¥, a symbol used in LECTAUREP's datasets to transcribe signatures and paraphs. These were straightforward modifications, thanks to Chocomufin. On the complete opposite, "University of Denver Collections as Data - HTR Train and Validation Set JCRS_2020_5_27" is a massive dataset (2660 XML files), but there are segmentation errors in this dataset, creating erroneous transcriptions given the way the line is drawn, and the annotation of the superscripted text is not compatible with CATMuS. To make it compatible with CATMuS, it would be necessary to control and correct each page one by one.

I chose to focus on datasets with priority 2 for the first version of McCATMuS. Indeed, it'll be possible to add more datasets into CATMuS in later versions, so there was no need to spend too much time on manually cleaning datasets. I had 23 with priority 2 to go through.

Identifying eligible datasets was not as time consuming as cleaning them and collecting additional metadata turned out to be. However, it gave me a good idea of the challenges I would face when trying to aggregate the datasets. I would have liked to be able to find a greater diversity of languages, but this is wasn't possible at this stage, mainly because many non-French datasets require more elaborate corrections than applying Chocomufin and were thus given a priority score higher than 2.

The next post will be covering the tedious phase of data cleaning and aggregation, along with metadata collection!

It was the case in "Données vérité de terrain HTR+ Annuaire des propriétaires et des propriétés de Paris et du département de la Seine (1898-1923) where the ALTO XML files are not explicitly linked to their corresponding source images. I believe it can be fixed, but it would require creating a script just for this purpose and the dataset presented other incompatibilities with CATMuS' guidelines. ↩
For example, "Argus des Brevets" contains some segmentation errors that will need to be corrected manually. ↩

019 - McCATMuS #2 - Defining guidelines

Alix Chagué

2024-08-20

Previous experiments have shown that conflicting transcription guidelines in training datasets make it less likely that a model will learn to transcribe correctly. This is particularly relevant when it comes to abbreviations and it's something to keep in mind when merging existing datasets. We didn't really address this when we trained the Manu McFrench model because it's difficult to retroactively align datasets to follow the same transcription rules. Unless you can afford to manually check every line, of course. In the case of Manu McFrench however, we only merged datasets that didn't solve abbreviations, so we ensured a minimum of cohesion.

CATMuS was built on the foundation laid by CREMMALab and the annotation guidelines developed by Ariane Pinche at the end of a seminar organized in 2021. These guidelines are intended to be generic, meaning they should be compatible with most transcription situations and are not project-specific. Following these guidelines will help data producers create ground truth that is compatible with data from other projects. It will also help those projects save time by not having to create transcription rules from scratch. From my experience, it is indeed easy for the members of a project discovering HTR to get caught up in the specifics of one project and forget what is and is not relevant (or even complicating) in the transcription phase.

It's worth mentioning that a project can choose to follow some of the CATMuS guidelines, while maintaining more specific rules for certain cases. If that's the case, the CATMuS guidelines can (should?) be used as a reference point. Ideally, the specific rules defined by a project should be retro-compatible with CATMuS. For example, if a project decides to use a special character to mark the end of each paragraph, then in order to create a CATMuS-compatible version of the dataset, I should only have to replace or remove that character. In such cases, the special character that was chosen should be unambiguous and the rule should be explicitly presented.

As CREMMALab focused on the transcription of medieval manuscripts, so did the first CATMuS dataset and guidelines. As I said in my previous post, I focused on data covering the modern and contemporary periods, for which there was no equivalent to the CREMMALab guidelines. So, when extending CATMuS to these periods, I started with collecting existing guidelines and comparing them. I used the CREMMA Medieval guidelines, the CREMMA guidelines for modern and contemporary documents, SETAF's guidelines and CATMuS Print's guidelines as a basis to elaborate the transcription rules for McCATMuS.

For each rubric, I compared what each set of rules suggested, when they covered it. It was rare for all guidelines to align, but some cases were easy to solve. For example, all the guidelines recommended not to differentiate between regular s (⟨s⟩) and long s (⟨ſ⟩), except for the rules I had set for the modern and contemporary sources transcribed by CREMMA in 2021, before the CREMMALab seminar. It was thus decided that for McCATMuS there would be no distinction between all types of s's.

Some rubrics needed to be discussed to figure out why the rule had been chosen in the first place by some of the projects, to decide which one to keep for McCATMuS. In February, I met with Ariane Pinche and Simon Gabay to go over the rubrics that still needed to be set. One example of a rule we discussed is how hyphenations are handled. CATMuS Medieval and the two CREMMA guidelines say to always use the same symbol (⟨-⟩), whereas for the SETAF and CATMuS Print datasets, inline hyphenations (⟨-⟩) are differentiated from hyphenations at the end of a line (⟨¬⟩). Other symbols, like ⟨⸗⟩, were unanimously rejected.

Two factors were considered when making those decisions: the feasibility of a retro-conversion for the existing datasets and the compatibility of the rule with a maximum of projects. In the case of hyphenations, I eventually decided to follow the same rule as CATMuS Medieval and CREMMA. On top of simplifying the compatibility of McCATMuS with CATMuS Medieval, I found that replacing all ⟨¬⟩ with ⟨-⟩, rather than retroactively place ⟨¬⟩ where there was indeed an hyphenation at the end of a line¹ was much more straightforward.

Once the set of rules was fixed, I used it to sort between the different datasets I had identified (I'll discuss this in the next post) and to decide which one would be retained for McCATMuS v1. I also defined the transformation scenarios necessary to turn each of these datasets into a CATMuS-compatible version. Then, once McCATMuS v1 was ready, I integrated the modern and contemporary guidelines into the CATMuS website, where the transcription guidelines for CATMuS medieval were already published.

Now that I am done integrating the rules set for McCATMuS into the website, I am confident that we have successfully designed rules that are overall compatible across the medieval, modern and contemporary periods, despite some unavoidable exceptions. Two good examples of the impossibility to cover a whole millennium of document production with the same rule are the abbreviations and the punctuation signs.

I've now explained how the transcription guidelines were established for McCATMuS. Next, I'll cover how they were integrated into existing datasets to create the first version of the McCATMuS dataset.

You can't assume that every instance of ⟨-⟩ at the end of a line must be replaced with a ⟨¬⟩. In many cases, this can be a simple typographic decoration marking the end of a paragraph or the end of a title. ↩

018 - McCATMuS #1 - Overview

Alix Chagué

2024-08-14

Last week, I attended ADHO's annual conference in Washington DC. I presented a short paper, co-authored with Floriane Chiffoleau and Hugo Scheithauer, about the documentation we wrote for eScriptorium (I wrote a post about it last year and you can also find our presentation here). I was also a co-author on a long paper presented by Ariane Pinche on the CATMuS Medieval dataset.

CATMuS, which stands for "Consistent Approach to Transcribing ManuScripts", is a collective initiative and a framework to aggregate ground truth datasets using compatible transcription guidelines for documents from different period written in romance languages. It started with CATMuS Medieval, but since January this year, I have been working on a version of CATMuS for the modern and contemporary period.

While I should (and will) try to publish a data paper on CATMuS Modern & Contemporary (I'll call it McCatmus from now on), I figured I could start with a series of blog posts here. I want to describe the various steps I followed in order to eventually release a dataset on HuggingFace and hopefully soon the corresponding transcription model.

I started working on McCatmus in January, but because of a major personal event (I moved to Canada!), it took seven month of stop-and-go before the release of the V1. This was particularly challenging due to the scale of the project and its technicality (it was hard to get back into McCatmus after several weeks of interruption, which I had to do several times).

To add to this complexity, McCatmus was also a multi-front operation. Indeed, to create McCatmus, it was necessary to:

define transcription guidelines in collaboration with other data producers,
identify datasets compatible with the guidelines and set priorities,
actually make all the dataset compatible with each other and clean some of the data,
model and collect metadata that made sense for this dataset,
release the dataset and fix the issues that came up.

To this date, two tasks remain on my to-do list for McCatmus: train a transcription model corresponding to this dataset and compare it with other existing ones, and make sure to have a publication describing this dataset and its usefulness.

My plan is to dedicate one post to the creation of the guidelines for the dataset, then a post about the identification and collection of the datasets used in McCatmus v1, and then I'll wrap up with a post about the process to create the dataset, the metadata and the release. Stay tuned!

017 - Deploying eScriptorium online: notes on CREMMA's server specifications

Alix Chagué and Thibault Clérice

2023-12-22

eScriptorium is a web application designed to perform automatic text recognition campaigns, by default powered by the OCR/HTR engine Kraken. It comes in a decentralized form, meaning that the application is not distributed by a single organization but can, on the contrary, be deployed by several actors on many different servers. In fact, you can also deploy eScriptorium on your personal machine, simulating a local server.¹

As eScriptorium is gaining attention, more institutions are interested in building their own server to host the application and offer it to their associates. At Inria, we deployed eScriptorium for the first time in 2020, specifically for the project called LECTAUREP which we ran with the French national archives between 2018 and 2021. While the initial server was hosted on a virtual machine, without any GPU, and open to a relatively small amount of users, our current eScriptorium application already counts nearly 500 users and will soon be hosted on a much different server infrastructure, funded by the CREMMA project. Between the original LECTAUREP-eScriptorium server and the CREMMA server, we moved to a dedicated server (Traces-6) for which we invested about 20K€.

Since I have been regularly in touch with people from different institutions who were looking into buying the hardware to create their own server for eScriptorium, I thought it was largely time to put all the deets in writing!

To write today's post, I'm very happy to welcome a second pair of hands: Thibault Clérice's. His expertise and involvement in designing CREMMA server are crucial here!

Let's first discuss some technical requirements, then we'll describe how the CREMMA server was designed. We finish with some very important remarks on the necessity (or not) to build a server and on useful alternatives for the community!

Should you buy GPUs?

GPUs (or Graphics Processing Units) are not mandatory at all when you use eScriptorium. This is the reason why it is perfectly acceptable to run eScriptorium locally, on your own computer. Actually GPUs are not even mandatory to train Kraken models: training can be done on CPUs (your computer's processor), they will simply go much much much slower.

That, however, is true for personal or light use of the training features. If on the contrary you create a server open to dozens of users or more, then connecting eScriptorium to GPUs is very much a good idea: since training a model on a CPU alone can take 2-3 days (or much more), you don't really want 10 users to start a training task at the same time. In the absence of shared GPUs, their training will be queued for days or even weeks and the overload might degrade the experience of other users on the rest of the application. As long as we are building an infrastructure (and hopefully sharing costs), we may as well enhance the experience of everyone, no?

This being said, you shouldn't rush and go buy a GPU right away. Instead, you should first look at options to optimize its usage or at infrastructures that are already available to you. For example, the FONDuE infrastructure, at the University of Geneva, doesn't use the GPUs only for eScriptorium: they connect their application to a cluster which is used by researchers for intense computation tasks outside of eScriptorium (it's an HPC with a university-wide queue controlled by SLURM). This is a very good solution for optimization, because training Kraken models is not a constant activity: if the GPU is dedicated to eScriptorium only, then it will be used for a few hours here and there, not even at 100% of its capacity. Think of it: users of the application will usually need to train a model at the beginning of their transcription campaign, therefore once they have an accurate model, they will focus on using the model for prediction, which doesn't rely on the GPUs (and Kraken isn't really optimized for GPU usage at prediction time anyway).

Other possibilities include connecting the server to a completely physically separate cluster where training jobs are submitted. This is a possibility that several people told me they were exploring, but I don't know if anyone has set it already. Why would you opt for a solution with an external cluster? To replace some huge investment costs (original funding) with some smaller (but much more regular) functioning costs: for example, for CREMMA, nearly half of our 40K€ budget was spent, in 2022, on buying two A100 graphic cards from Nvidia. When using someone else's GPUs, not only you save the money you would spend on the hardware, but on top of that, you contribute to optimizing the use of other GPUs already in place. Another reason is because you might not have the human resources to administer the system and the GPUs. There are multiple calculation clusters created for Academia (of the top of our head: Jean Zay or Calcul Québec), and you could even consider using commercial solutions as well (like AWS, Google Cloud and the like). Then, your money is spent on the actual computation and not on making the computation possible in the first place.

Fair enough, plugging eScriptorium's task manager to an external server might not be that simple. However, for smaller groups of users, it is also worth taking into account that it is perfectly possible to train Kraken models using Kraken directly (through an SSH connection to a (super-)cluster, for example) before uploading them into the application. In such a case, eScriptorium is only used for its ergonomics, not as a simplified interface to train models.

Let's summarize the point here: GPUs are not always a must-have for eScriptorium or Kraken, so you should definitely consider first and foremost your future usage. They currently represent the biggest share in the hardware expenses to build a calculation server. There are options out there where you don't spend 10K€ to buy a GPU but rather connect to an external, ready-to-use service. Or, if you do decide to spend the money, you should consider ways to maximize its usage for other training tasks, possibly outside of eScriptorium.

Some considerations on storage

Normally, eScriptorium is used as an (assisted) annotation environment to obtain the transcription of documents. You would use eScriptorium:

In a preparatory phase:
- (1a) to produce training data, and
- (1b) to elaborate (aka train) performant segmentation or transcription models;
In a production phase, but only for relatively small corpora, to apply segmentation and transcription models and manually correct the results (in which case the size of the corpora must be compatible with the scale of what an individual or your assembled team can process);
In a post-production phase, including for samples of a very large corpus, to easily visualize and control the result of the (large-scale) automatic prediction and potentially correct it (cf. n°2).

On the other hand, large scale transcription campaigns should probably be led with Kraken in the command line directly (so only n°1 and n°3 necessitate eScriptorium). Thibault has even produced a small python library to design such campaigns (RTK, for Release the Krakens) which was recently used in a paper² where a 38.5M token corpus was produced. In some cases, n°1b even benefits from being performed outside of eScriptorium, since the application offers a very limited control over Kraken's training parameters.

This has several consequences on the way you should consider storage on a server dedicated to eScriptorium. Duplicates of images are created on the server while they are being processed in the application, but they should always be considered as such: temporary duplicates while phase 1, 2 or 3 are under progress. They shouldn't be considered as if eScriptorium was 1) an archiving solution for transcription projects, 2) a querying interface to explore a corpus or even 3) a publication environment for a minimalistic digital edition. eScriptorium is only one brick --an early one even-- in the corresponding pipelines. Instead, the original image files should be stored somewhere else, in an adapted data warehouse (like Zenodo, Nakala, etc.), or published in digital libraries under the responsibility of their owner (like Internet Archive, Gallica, etc.).

What this means when designing a server to host eScriptorium is that its storage capacity should of course be big enough to store the temporary image files,³ while users are working on their annotation, aka the active projects. However, this storage doesn't need to be expended all the time and it should also be ok to flush the terminated projects: at that point the images and their annotations should have been archived on more appropriate data warehouses by their creators, and it should be their responsibility.

Don't forget the RAM!

Not overlooking the RAM is very important when designing your server! But what is it used for? It's used for cache by the web application: it means that frequently accessed data, like web pages and images but also the content of the database, are temporarily loaded in live memory. Cache thus ensures that the requests sent by the users are served quickly. For example, if you don't have enough RAM (or enough cache), pages will load slowly, and if you have used eScriptorium before reading this post, you know how important it is to be able to load images fast enough.

RAM is also essential for inference and training because images and annotations are loaded in memory before being passed to the CPU or the GPU. If the RAM is not powerful enough, it will be detrimental to computation and will cause a bottleneck situation. Thus having invested in GPUs and/or CPUs but not in enough RAM would be like having a horse to pull a Ferrari: even if prediction and training could go fast on the processing units, it will be restrained by the available live memory.

Modularity for the CREMMA infrastructure

The CREMMA infrastructure was originally designed by Thibault with a simple but essential principle in mind: modularity. Instead of thinking of an eScriptorium server as a monolithic block of hardware designed for front-end service, storage and intense computation, he suggested to break each of these blocks into individual servers connected together. CREMMA⁴ is thus made of at least three servers, as shown in the schema below:

CREMMA_FRONTEND, for the front-end, where the application is deployed and where the database is stored.
CREMMA_STORAGE, for storage, where all the images and models, as well as the backup of the database are stored on the long term. Currently, CREMMA_STORAGE has a storage capacity of 38Tb⁵ but we could easily add more disks if we find that it is necessary.
CREMMA_COMPUTE, where the two A100 GPUs I mentioned earlier are plugged and where the application task manager "sends" all the jobs, whether they are to be run on CPU (these tasks include segmentation and transcription prediction for example), or on GPU (training for the most part).

A model of the CREMMA infrastructure where three blocks (front-end, storage and compute) are connected together through an intranet 10Gb/s connection. For each block, one or two server(s) is presented along with their specification. Credits: Thibault Clérice and Alix Chagué. The full text of the specifications is accessible in a commentary in the source code of this page, just after this image.

As you can see on the schema, there will actually be a fourth server involved in the infrastructure: Traces-6, the server we currently use to deploy eScriptorium at Inria. Like CREMMA_COMPUTE, Traces-6 can be called by CREMMA_FRONTEND for computation tasks. In fact, this is where the modularity of the system is interesting: with such a set-up, it is possible to add more computation servers to the pool of GPUs reachable by CREMMA_FRONTEND without having to redesign the whole infrastructure. On their side, CREMMA_FRONTEND and CREMMA_STORAGE can be upgraded (to add more RAM or more storage) very easily.

This modularity also means that the GPUs remain free for other uses: for example if we were to have to run maintenances on CREMMA_COMPUTE, we can simply cut it from the infrastructure, and let CREMMA_FRONTEND interact with Traces-6 only while we work on CREMMA_COMPUTE.

CREMMA_COMPUTE is equipped with two A100 graphic cards, and Traces-6 with two RTX 6000. Actually, it doesn't mean that only 4 training can be happening at once. Each of these GPUs offer between 24 and 40 Gb of RAM for intense computation. It's a lot. It's so much actually that training a Kraken model at max speed would rarely use more than 40% of this processing power. Virtualization is a nice trick to "break" the GPU down into smaller virtual GPUs (or vGPUs). What is broken down is the RAM capacity. We opted for the following virtualization set up:

Each of the A100 graphic cards and their 40Gb of RAM are turned into 1 10Gb vGPU + 5 5Gb vGPUs (since 10+5x5=35, note that we must leave 5Gb out of the equation for the virtualization).
No virtualization is applied to Traces-6's RTX6000s.

How did we decide on these numbers? Thibault ran a series of small tests executing either segtrain or train and playing with two different parameters: the batch size⁶ and the single point precision⁷. He found that for training a recognition model with a batch size of 8 and either 32 or 16 of precision, less than 5 Gb of RAM on the GPU is enough. With a batch size of 1 and a precision of 32, it's even less than 1 Gb. To train a segmentation model, less than 10Gb is enough, and this type of training is more rare. Since our goal for the infrastructure is not to maximize the speed of the training but to maximize the amount of possible parallel training jobs at decent speed, we decided that 10 vGPUs with 5Gb of RAM and 2 vGPUs with 10Gb of RAM were a good compromise. If we find that more GPU RAM is occasionally needed, we still have two times 24Gb with the RTX6000!

Should you build your own server?

We have spent all this time writing about how to build, how to spec out your server or your infrastructure, but let's talk about the elephant in the room: should you do it?

Well, it's all a matter of perspectives. We'd say it probably makes sense if:

You are a very big organization, you have a lot of money available to you, a super-cluster (and possibly a well staffed IT services department), and you have a high demand;
You are working on very sensitive data that can't be shared with the outside (e.g. medical reports);
You are geographically far away from any other existing server, and face latency issues when you connect to potential welcoming servers;
Servers that exist around you are reluctant to onboard you and the teams behind the request for a server of your own.

These four points are definitely valid. But we'd say that, if you are in another situation, sharing infrastructural costs probably makes way more sense. In our experience, building a server is long, tedious, require special (and rare) skills⁸ and costly (in terms of human resources as well!). Setting up a working server can take a really long time. For CREMMA, we ended up outsourcing part of the installation of the new infrastructure because we realized that we did not have the time nor the skills to set everything up ourselves. The cost of this installation by a third-party? Between 8 and 12K€, and again, a little time and bandwidth on our end.

Next you have the maintenance fees. You can outsource them, for a little bill from a company which would make sure that everything is installed on time, that updates work well, etc. Or you can do the maintenance yourself. But again, this comes with a cost: human time. A worker on the server goes down? You are in for a few hours. Some people crashed a third-party server by uploading too much IIIF images on your instance of eScriptorium? Well, then you will not only receive emails from these third parties (and this is completely normal), but also have to deal with your user base doing things that eScriptorium allows and that you may not (yet) be able to control/limit.

In the end, we would definitely recommend that, when this is possible, you first consider joining existing servers, including by offering quid pro quo by:

Participating in covering the salary of people maintaining the server (through some kind of yearly fees for example);
Providing some money to expand the existing infrastructure (to increase storage or computation, etc);
In general, helping eScriptorium grow, discussing with the owners of the server you are joining and/or the eScriptorium team about what kind of new functionality should be added, and if you can contribute to fund these updates.

This final point is super important: sure, owning your own server sounds appealing, even if it is costly to put in place. However, developing eScriptorium also comes with expenses. Thus, participating in eScriptorium directly -- we think -- is also very beneficial and welcome by the developing team. Open-source is free to use, free of charge but is not appearing out of thin air: developing costs money. And the more people participate in infrastructural costs (servers or software), the better the experience will be.

If you don't know anything about local servers and are curious to learn more, you can check this page: https://www.freecodecamp.org/news/what-is-localhost/. Or you can also take a look at the corresponding entry in Wikipedia! ↩
The full reference is: Jean-Baptiste Camps, Nicolas Baumard, Pierre-Carl Langlais, Olivier Morin, Thibault Clérice, et al.. Make Love or War? Monitoring the Thematic Evolution of Medieval French Narratives. Computational Humanities Research (CHR 2023), Dec 2023, Paris, France. ⟨hal-04250657⟩ ↩
By temporary, we don't mean that the image file are stored for a few hours only, on the contrary, they can stay on the disk for many years. We mean that it should be ok to consider that they can be erased whenever a user is done working on a corpus and has moved away from the transcription phase. ↩
From now on, "CREMMA" means the server created through the CREMMA project. ↩
Safety first! We have 38 Tb available, but there is actually a little more physically because we have redundancy and spare. We have 2 series of disks working with redundancy (RaidZ). In each series two disks are entirely dedicated to redundancy only, and one more is completely unused until something fails (it is used as a safety spare disk). While CREMMA_STORAGE, as we said before, is not used as a permanent storage solution, it needs to be a little bit safe for the user base. ↩
To understand what the batch size corresponds to and why it is important, you can check this entry in the Stack Exchange forum: https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network. ↩
To quote Kraken's documentation: "When using an Nvidia GPU, set the --precision option to 16 to use automatic mixed precision (AMP). This can provide significant speedup without any loss in accuracy." Kraken's default value for precision is 32. ↩
It can be difficult to justify hiring a full-time or even part-time system administrator for a team because it is a very specialized and highly demanded type of profile. For example, public organizations can rarely offer competitive salaries compared to the private sector. In addition, the workload for administrating a web server can be irregular, and it can be difficult to make the skills for system administration meet with other needs faced by a team, complicating even more offering a meaningful full-time job. ↩

016 - Text Recognition, Large Models and Expectations

Alix Chagué

2023-11-28

Since the boom around ChatGPT almost a year ago, I've heard several people wondering if "tools like ChatGPT" were more efficient than HTR models trained with Kraken and the like. The glimmer of hope in their eyes was most likely lit by their own struggle to set successful and/or efficient HTR campaigns with more traditional tools. The capacity of Large Language Models (LLMs) to reformulate a text¹ or, more specifically, of Large Multimodal Models (LMMs) to generate text based on a visual input may indeed lead people to believe that HTR technologies built on CNNs are on the verge of being flipped upside-down.²

Annika Rockenberger recently conducted a series of small experiments on the matter and wrote an interesting blog post about it. Let's summarize it!

She signed up for a premium subscription (25$/mo) to be able to chat with GPT4, which allows users to upload images. Then she submitted printed or handwritten documents she would normally transcribe with Transkribus and assessed the results. She found that GPT4 was fairly good on ancient print (German Fraktur) and that it was even able to follow transcription guidelines if provided with an example. However on a letter bearing handwritten cursive, the model completely hallucinated the content and attempted a transcription in the wrong language. This didn't change when she provided more context on the document. Rockenberger concludes that there is a potential for using ChatGPT for HTR but that the capacity of scaling it up is completely unsure and that learning how to provide good prompts to get the appropriate results is a challenge. I would also add that in the end, Rockenberger paid 25$ to get 10 lines of raw text, whereas with software like Transkribus or eScriptorium, she would also get a standard structured output.

So, in other words, after reading Rockenberger's post, one can conclude that GPT4 (or, better, similar free and open source models) does have a potential for "quick and dirty-ish" OCR. However, I would argue that users tempted by this strategy might still miss an important point: even LMM-based tools will requires a little bit of organization and precision from the users. This, I find, often lacks in unsuccessful HTR campaigns. LMMs could generate a good output, but you will likely have to pay a counterpart one way or the other(s): with lower text recognition quality, with hallucinated text content, with impoverished non-structured output, with premium fees, etc.

Earlier this year, an article proposed by Liu et al. (2023), "On the Hidden Mystery of OCR in Large Multimodal Models", explored almost exactly the same topic but in a more comprehensive way. Their article presents an extensive survey of how well several Large Multimodal Models (LMMs) performed on "zero-shot" tasks.

Zero-shot refers to the act of requesting an output from an LLM or a LMM without training it for this task in particular. It is very similar to Rockenberger's first attempt with GPT4, when she uploaded the image of a printed document and asked for its transcription. In such a case, she relied on the capacity of the model to transfer its knowledge to the specific tasks of Text Recognition, on a specific type of documents (historical printed text).

Other terms are often associated with "zero-shot:" "one-shot" and "few-shot". One-shot is equivalent to Rockenberger's second attempt: when she showed GPT4 an example of the output she expected on the 10 first lines of the documents, and requested that the model copied her strategy to generate the transcription of the 10 next lines. Few-shot would mean showing several pages and several expected output to the model before asking for the transcription of a new document.³

The paper focused on currently available LMMs representing five different approaches for training LMMs:

They evaluated the models on 4 tasks: text recognition, text-based visual question answering, key information extraction and handwritten mathematical expression recognition. Here are a few examples of what these tasks entail, as illustrated in the original article (on the images, P stands for Prediction and GT for Ground Truth):

Task	Example
Text Recognition
Visual Question Answering
*Key Information Extraction
Handwritten Mathematical Expression Recognition

For each task, they used several datasets presenting different challenges. For each of these datasets and tasks, they retrieved the scores of the state-of-the-art (sota) for supervised methods and used them as a baseline. For example, for text recognition on the IAM dataset, the sota method of AttentionHTR⁴ reaches a word accuracy of 91.24%.⁵ In comparison, Liu et al provide the following scores for the tested LMM on this dataset:

test LMM	Score on IAM
BLIP-2 OPT_6.7b	38.00
BLIP-2 FlanT5_XXL	40.50
OpenFlamingo	45.53
LLaVa	50.40
MiniGPT4	28.90
mPLUG-Owl	42.53
---------------	-----
Supervised SOTA	91.24

The illustrations provided by the article are all of failed attempts, but it corresponds to the overall impression conveyed by the results of the experiments. Indeed, compared to the state-of-the-art supervised methods, zero-shot tasks prompted to LMMs yield results largely outperformed, similar to what is visible in the case of text recognition on the IAM dataset. The only exception is BLIP-2 on a Text Recognition task on a dataset of artistic text (WordArt) which is more challenging. The authors consider that this is a sign that LMMs have a promising potential for visually complex texts.

A very important section of their paper is their remarks on the relationship between LMMs and semantics. Submitting non-word images to the LMMs, they find that the LMMs systematically over-correct the prediction and suggest real-words as an answer. Traditional text recognition approaches, on the other hand, are much less sensitive to the notion of likelihood for the words to recognize. Similarly, the need for semantics interferes with the LMMs' output, and they tend to more easily recognize common words and make up additional letters ("choco" is read as "chocolate"). Lastly, LMMs are insensitive to word length: they are unable to count how many letters are in the image of a word. These results are similar to what Rockenberger experienced with the handwritten letter: the model hallucinated words to compose a semantically plausible letter. But using the wrong date, the wrong names, and the wrong language.

Liu et al conclude their paper reminding us that they experimented with the capacities of the models in the context of zero-shot prompts, whereas there are already successful attempts at fine-tuning LLMs and LMMs on specialized tasks, such as medical prediction. In fact, I think there already exist such attempts in the context of HTR as well: it seems to be the ambition of a model like Transkribus' Text Titan, released at the beginning of the Summer. It is based on a Transformer coupled with an LLM. Unfortunately, I wasn't able to find more information on this model, aside from the community-oriented communications released by Transkribus on their website (here and here).

In stead of a multimodal approach, Salvatore Spina explored the possibility to use a LLM-based tool like ChatGPT3 to post-process the result of HTR and correct the text. See: Spina, S. (2023). Artificial Intelligence in archival and historical scholarship workflow: HTS and ChatGPT (arXiv:2308.02044). arXiv. arXiv.2308.02044. ↩
Multimodality is presented by some researchers of the Digital Humanities community as a real epistemological turn for the field. See for example: Smits, T., & Wevers, M. (2023). A multimodal turn in Digital Humanities. Using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections. Digital Scholarship in the Humanities, fqad008. doi: 10.1093/llc/fqad008 ; or Impett, L., & Offert, F. (2023). There Is a Digital Art History (arXiv:2308.07464). arXiv. arXiv.2308.07464. ↩
There are a few videos offering more or less detailed explanations on these expressions in the context of prompting an LLM. However, this is not specific to LLM, it is often used in the context of classification or NLP tasks for example. ↩
Kass, D., & Vats, E. (2022). AttentionHTR: Handwritten Text Recognition Based on Attention Encoder-Decoder Networks (arXiv:2201.09390). arXiv. arXiv.2201.09390. ↩
In this case, the WER is used as a baseline to compare different approaches. However, in general, it is not a good idea to only take into account Word accuracy to understand a model's performance in real life. This is something I discussed in this post. ↩

015 - Block post and comprehensive Exam

Alix Chagué

2023-11-07

When I created this blog last year, I wanted to post regularly on it. Something like once a month or once every other month. I didn't want to put pressure on myself for writing, but I also wanted to make sure that this blog would be alive. I often have ideas for topics for a post. But then, when comes the time to write, I blank out. It's not exactly that I don't know where to start, it's just that I sometimes can't figure out what is the message I want to convey. Like, if I have to summarize my blog post in 2 lines, what's the take-away? I get stuck when I cannot find an answer,but maybe I shouldn't worry that much about it. It's my blog after all, and maybe the message will come by the time I'm done writing.

So, without further ado, let's dive in: I was super excited this Summer after passing my comprehensive exam. I really wanted to write a post about it. I had a really packed Spring and beginning of Summer between going back to Montreal, teaching a class there, attending a Summer school, going to the DH2023 conference in Austria where I presented a short paper, a long paper and organized a workshop (big up to Thibault who was by my sides through all these Austrian adventures). And all of it culminated with that comprehensive exam in the middle of August. I really wanted to share how that went.

But then, vacations, working on new deadlines, more vacations, more deadlines... And now it's already November and I don't know anymore what it was that I wanted to share about that exam. Aside from the fact that I passed it and that it's a pretty big milestone.

The comprehensive examination, which is called "Examen de synthèse" in French, is not something common in France. In France, we now have a sort of yearly evaluation called the "Comité de Suivi Individuel" (or CSI), which is not a scholar evaluation but more of a check-up with your supervisors and a committee¹ in charge of making sure that everything is alright. The reason I bring it alongside the Examen de Synthèse is because I also had my first CSI this Summer (at the very end of June). In France, you have to have a positive evaluation from the CSI in order to enroll in a new year of doctoral studies. Each year. But, actually the CSI and the Examen de Synthèse are not really that comparable.

The Examen de Synthèse is a "real" examination and it happens only once during your doctoral curriculum. In my program at the University of Montréal, in 2023, it consisted in several phases.

First of all, there is a phase dedicated to the composition of the jury. I had the pleasure to be examined not only by my three supervisors (Laurent Romary, Emmanuel Chateau-Dutier and Michael Sinatra), but also by Marcello Vitali Rosati, from the University of Montréal, who acted as president, and Maxime Gohier from the University of Quebec in Rimouski. I must signal that my only regret is not to have been able to have a better gender parity in my jury. This is something I really hope to fix for my defense, but I will probably have other occasions to discuss this topic in the future.

So, once the jury is composed, and once a calendar has been agreed on (I think that was actually the most stressful part for me because of all the other things I had this Summer), a count down begins. First, I had to turn in three documents:

a 12-15 page-long essay on my research project;
a 30-reference long bibliography on the field of the Digital Humanities; and
a short presentation of a proposed "practical" analysis.

Then a week later, the jury sent a question.² I was given 1 week (168h exactly) to think about this question and write a response in the form of a 10-15 page-long essay. The jury had between a week and two weeks to read the response before an oral examination took place (on Zoom).

The oral examination has some similarities with a PhD defense. It started with a 20 minute long presentation that I gave where I summarized my research project (10 minutes) and presented a technical analysis (10 minutes). I chose to focus my technical presentation on an experiment I have been conducting and on which I hope to communicate more in the near future. Then, after my presentation, there were two rounds of questions about my research project, my experiment or about the answer I formulated in my essay.³

I am very happy that such an examination exists in the North American program. It may seem like a lot of stress (and it is), but I found that it is also a very good milestone to progress a lot towards the formalization of a research project. The oral examination is a great opportunity to present a project to people who don't necessarily know what you have been up to before, and it's a really really great occasion to get feedback.

For example, the question that is sent by the jury, in the case of my program, is thought as a way to get you to think about a topic or a question that is either not tackled enough by your research proposal, or it's an invitation to consider new angles. You're not expected to turn in the perfect answer, of course, with barely a week to write it. But it forces you to form an opinion, explore possible hypotheses and may turn later into a whole chapter for your thesis.

The comprehensive exam is a pass/no pass type of examination. There is no grade and if you fail, you can take it a second time. Like I said before at the beginning of this post, I passed. Therefore, starting from Fall 2023, I am now able to enroll as a "en rédaction" student (writing status) which has several consequences. Some seem very symbolic: for example, in English, I can now call myself a PhD candidate instead of a PhD student. But others not so much: tuitions for this new status are much lower than when enrolling as a full-time student, dropping from 1,440$CA/trimester to 512$CA/trimester, and I believe this officially gives me the right to teach at graduate level.

The comprehensive exam also marks the end of the phase during which I had to take courses. Now, with this new status, I am invited to focus solely on the redaction of my thesis, which opens up a whole new chapter for my PhD curriculum.⁴

I want to take this occasion to also thank Ariane Pinche and Joana Casenave, who were willing to be the members of my committee for the CSI, for their precious feedback! :) ↩
The question was the following: "Dans votre projet de recherche apparaît une tension importante: celle entre la spécificité des besoins particuliers de chaque projet et la volonté -- et la nécessité -- de produire des approches généralisables, qui puissent être employées dans le cadre de plusieurs projets. En vous appuyant sur votre bibliographie, et en vous concentrant notamment sur le cas du HTR, pourriez-vous analyser cette tension en soulevant en particulier la question de la littératie demandée (notamment dans la gestion des données) pour pouvoir personnaliser des approches computationnelles aussi complexes que les technologies HTR?" ↩
I want publish on my blog the documents I created for the comprehensive exam, but I need to find the best way to do it. I'll post an announcement when it will be available. ↩
Thank you Jennifer for this wonderful pun! ;) ↩

A research (b)log

024 - The messy backstage of a literature review

Perimeter of the literature review

Collecting the articles and their metadata

Filtering the articles

Hits per journal

Dépouillement and analysis

023 - Writing a PhD manuscript with Markdown and Quarto

022 - McCATMuS #5 - Training models

021 - McCATMuS #4 - Cleaning data, collection metadata

020 - McCATMuS #3 - Datasets selection

019 - McCATMuS #2 - Defining guidelines

018 - McCATMuS #1 - Overview

017 - Deploying eScriptorium online: notes on CREMMA's server specifications

Should you buy GPUs?

Some considerations on storage

Don't forget the RAM!

Modularity for the CREMMA infrastructure

Should you build your own server?

016 - Text Recognition, Large Models and Expectations

015 - Block post and comprehensive Exam