Abstract
<jats:p>INTRODUCTION: In recent years, large language models (LLMs) have found widespread application in healthcare. However, their effectiveness in solving specialized tests in Russian, particularly in anesthesiology and critical care, remains poorly studied. OBJECTIVE: To evaluate the performance of LLMs on single-answer multiple-choice questions in Russian anesthesiology and critical care, compared to results of resident physician teams from the "Professionals" competition at the Forum of Anaesthesiologists and Reanimatologists-2025. MATERIALS AND METHODS: We conducted a comparative study of responses to 30 test items from the qualifying stage of the "Professionals" competition. Results from 38 resident teams were compared against answers from LLMs: Generative Pre-trained Transformer (GPT)-4o, GPT-5, Alisa AI, DeepSeek V-3.2, GigaChat, Gemini 2.5 Flash, and Qwen3-Max. Comparison methods included rank analysis, pairwise comparison (win-rate), agreement assessment (Cohen's κ coefficient), and correlation analysis (φ-coefficient). RESULTS: The median score of participating teams was 24.5 out of 30, with one team achieving the maximum score (30/30). Four models (GPT-4o, GPT-5, DeepSeek V-3.2, Gemini 2.5 Flash) demonstrated 100 % accuracy (30 points), sharing first rank position with the leading team. These models achieved the 97th percentile, reflecting superiority over 37 of 38 participating teams. Qwen3-Max and Alisa AI scored 29.9 and 29 points respectively, ranking first and second in the overall rating (97th and 92nd percentiles). GigaChat provided no answers. The win-rate of LLMs against a randomly selected team varied from 0.97 to 1.00. Near-perfect agreement was observed among leading models (κ = 1.00), with very high correlation between their answers and the majority choice of residents (φ ≈ 1.00). No statistically significant differences were found between LLM results (p > 0.05). CONCLUSIONS: Modern large language models demonstrate high accuracy in solving standardized tests in anesthesiology and critical care in Russian, significantly exceeding the median performance of resident physician teams.</jats:p>