Abstract
<jats:p>The paper presents a pipeline for building a high-quality, multi-layer annotated corpus for Kazakh NLP. The pipeline integrates large-scale web crawling, aggressive text cleaning, transformer-based pre-annotation, linguist-in-the-loop validation, and quality assurance via inter-annotator agreement (IAA). The final release contains 20,000 documents, 350,000 sentences, and 6.2 million tokens spanning 5 domains (news, politics, science, education, culture), annotated with POS, morphology, NER, and UD dependencies. The report IAA per layer and baseline model performance (POS, NER, parsing) to demonstrate utility. The study releases the corpus, code, and guidelines to support reproducible research. The proposed methodology includes the complete workflow from data acquisition through targeted web crawling of prominent Kazakh-language news outlets, to text cleaning, automated pre-annotation utilizing transformer-based language models, and manual validation via a custom-designed annotation interface. Particular emphasis is placed on the agglutinative characteristics of Kazakh and its extensive morphological variants, which present distinct challenges in annotation and model training. The generated corpus comprises comprehensive annotations for part-of-speech (POS), named entity recognition (NER), morphological characteristics, and syntactic dependencies, establishing a fundamental dataset for several downstream NLP applications. The research additionally examines significant obstacles in the annotation process, including maintaining consistency, assessing inter-annotator agreement, and evaluating the adaptability and functionality of the annotation tools. Baseline NLP models were employed and evaluated to determine the quality and utility of the corpus. This work provides a reproducible and flexible methodology for constructing corpora in low-resource and morphologically complex languages. Its objective is to promote additional study, tool creation, and technological progress in Kazakh language processing, hence advancing the overarching purpose of multilingual NLP inclusion.</jats:p>