Abstract
<jats:p>In my study, I aim to present a preliminary outline of my doctoral research topic. My goal is to develop a research methodology that supports the study of calendar/regesta-based source editions through the use of artificial intelligence and text mining tools. The source base of my research is the corpus of charters from the Anjou-kori Oklevéltár (Documents of the Angevin Period in Hungary) (AÓkl) issued between 1301 and 1342, that is, during the reign of Charles I of Hungary. After 35 years of work, this corpus was completed in 2025. The study presents the main characteristics of the calendars/regesta published in the Anjou-kori Oklevéltár as well as the challenges and limitations of text cleaning prior to digital processing. Particular attention is paid to the phenomenon of so-called “dirty data”, which is a consequence of poorly executed text cleaning. Possible approaches to text preprocessing are also discussed. I briefly outline the fundamental characteristics of text-mining methods and describe the procedures – namely named entity recognition, n-gram analysis, topic modeling, and TF–IDF – that I intend to use in developing the methodology. Finally, I present a case study based on my own research, which concisely illustrates the aforementioned problems and possibilities.</jats:p>