BUILDING A HIGH-QUALITY ANNOTATED CORPUS FOR KAZAKH NLP: A PIPELINE APPROACH, ПОСТРОЕНИЕ ВЫСОКОКАЧЕСТВЕННОГО АННОТИРОВАННОГО КОРПУСА ДЛЯ КАЗАХСКОГО NLP: ПАЙПЛАЙН-ПОДХОД, ҚАЗАҚША NLP ҮШІН ЖОҒАРЫ САПАЛЫ АННОТАЦИЯЛЫҚ КОРПУС ҚҰРУ: ҚҰБЫРЛЫҚ ТӘСІЛ

Authors: A. Aitim

Publication: Вестник КазУТБ

Published: Dec 30, 2025

Source: Crossref

Back to Search View Original Cite This Article

Abstract

<jats:p>The paper presents a pipeline for building a high-quality, multi-layer annotated corpus for Kazakh NLP. The pipeline integrates large-scale web crawling, aggressive text cleaning, transformer-based pre-annotation, linguist-in-the-loop validation, and quality assurance via inter-annotator agreement (IAA). The final release contains 20,000 documents, 350,000 sentences, and 6.2 million tokens spanning 5 domains (news, politics, science, education, culture), annotated with POS, morphology, NER, and UD dependencies. The report IAA per layer and baseline model performance (POS, NER, parsing) to demonstrate utility. The study releases the corpus, code, and guidelines to support reproducible research. The proposed methodology includes the complete workflow from data acquisition through targeted web crawling of prominent Kazakh-language news outlets, to text cleaning, automated pre-annotation utilizing transformer-based language models, and manual validation via a custom-designed annotation interface. Particular emphasis is placed on the agglutinative characteristics of Kazakh and its extensive morphological variants, which present distinct challenges in annotation and model training. The generated corpus comprises comprehensive annotations for part-of-speech (POS), named entity recognition (NER), morphological characteristics, and syntactic dependencies, establishing a fundamental dataset for several downstream NLP applications. The research additionally examines significant obstacles in the annotation process, including maintaining consistency, assessing inter-annotator agreement, and evaluating the adaptability and functionality of the annotation tools. Baseline NLP models were employed and evaluated to determine the quality and utility of the corpus. This work provides a reproducible and flexible methodology for constructing corpora in low-resource and morphologically complex languages. Its objective is to promote additional study, tool creation, and technological progress in Kazakh language processing, hence advancing the overarching purpose of multilingual NLP inclusion.</jats:p>

Keywords

corpus annotation kazakh pipeline annotated

Abstract

Keywords

Related Articles

Patient's Satisfaction as a Subjective Criterion for Assessing the Quality of Work in Primary Level Healthcare Protection Units in Bosnia and Herzegovina

Studying the Effluent Quality of Enhanced Modified Ludzack Ettinger-oxic Settling Anaerobic Process (E-MLE-OSA) for Treating Real Municipal Wastewater

Investigating the Non-carcinogenic Risk and Hazard Quotient of Heavy Metals in High-traffic Districts of Tehran Metropolis, Iran

Relationship quality and consumer loyalty in high-tech services: The dual role of continuance commitment

Quality Assurance Information System-The Case of the TEI of Athens