Abstract
<jats:p>The paper presents algorithm for the automatic generation of thematic tests using the example of English language tests using the counterfactual analysis method to improve their quality based on a mobile application. A detailed analysis of the language domain led to the development of clear requirements for the future service. Key forms of assessment knowledge were classified, along with descriptions of typical exercises and the difficulty levels in which they are used, helping to create a comprehensive picture of the skills requiring step-by-step assessment. The challenges of existing tests are highlighted: ambiguous wording, multiple correct answers, and labor-intensive selection. This paper develops and tests a comprehensive approach to assessing the effectiveness of prompts for generating grammar tests based on Large Language Models. A counterfactual algorithm is proposed as a core, which allows identifying latent features that actually influence the choice of grammatical structures of the model, selectively modifying the prompt, and evaluating changes using three complementary metrics. The application of the algorithm showed that adding explicit indications of the most significant hidden features increases the model's sensitivity to key factors of the task. Further re-evaluation of quality using the developed metrics and independent expert review confirmed a statistically significant increase (p < 0.01) in both grammatical compliance and compliance with the structure of tasks: the average score increased from 0,91 to 0,95. Thus, counterfactual analysis is indeed an effective tool for fine-tuning prompts; the proposed improved prompt ensures more reliable generation of test materials that meet educational standards and lays the foundation for scaling the algorithm to other types of tasks and language skills.</jats:p>