Using large language models for solving tests in anesthesiology and intensive care: a comparative study, Использование больших языковых моделей для решения тестовых заданий по анестезиологии и реаниматологии: сравнительное исследование

Authors: Andrei A. Klimov, A. V. Karelin, S. B. Liapustin et al.

Publication: Annals of Critical Care

Published: May 5, 2026

Source: Crossref

Back to Search View Original Cite This Article

Abstract

<jats:p>INTRODUCTION: In recent years, large language models (LLMs) have found widespread application in healthcare. However, their effectiveness in solving specialized tests in Russian, particularly in anesthesiology and critical care, remains poorly studied. OBJECTIVE: To evaluate the performance of LLMs on single-answer multiple-choice questions in Russian anesthesiology and critical care, compared to results of resident physician teams from the "Professionals" competition at the Forum of Anaesthesiologists and Reanimatologists-2025. MATERIALS AND METHODS: We conducted a comparative study of responses to 30 test items from the qualifying stage of the "Professionals" competition. Results from 38 resident teams were compared against answers from LLMs: Generative Pre-trained Transformer (GPT)-4o, GPT-5, Alisa AI, DeepSeek V-3.2, GigaChat, Gemini 2.5 Flash, and Qwen3-Max. Comparison methods included rank analysis, pairwise comparison (win-rate), agreement assessment (Cohen's κ coefficient), and correlation analysis (φ-coefficient). RESULTS: The median score of participating teams was 24.5 out of 30, with one team achieving the maximum score (30/30). Four models (GPT-4o, GPT-5, DeepSeek V-3.2, Gemini 2.5 Flash) demonstrated 100 % accuracy (30 points), sharing first rank position with the leading team. These models achieved the 97th percentile, reflecting superiority over 37 of 38 participating teams. Qwen3-Max and Alisa AI scored 29.9 and 29 points respectively, ranking first and second in the overall rating (97th and 92nd percentiles). GigaChat provided no answers. The win-rate of LLMs against a randomly selected team varied from 0.97 to 1.00. Near-perfect agreement was observed among leading models (κ = 1.00), with very high correlation between their answers and the majority choice of residents (φ ≈ 1.00). No statistically significant differences were found between LLM results (p > 0.05). CONCLUSIONS: Modern large language models demonstrate high accuracy in solving standardized tests in anesthesiology and critical care in Russian, significantly exceeding the median performance of resident physician teams.</jats:p>

Keywords

models teams llms results from

Abstract

Keywords

Related Articles

Comparing the Performance Evaluation Models of Gas Refineries Using AHP and TOPSIS

A Simplified Approach to Evaluate Retinal Blue Light Hazard Using the Correlated Color Temperature of LED Light Sources

Investigation and Optimization of Air Pollution Risk by a Multi-criteria Decision Making Method Using Fuzzy TOPSIS: A Case Study of Construction Workers

ANALYTICAL CLOUD SYSTEM ERI Digital health analysis using Spectral analysis in AI

Modeling and Transformation of the Evaluation Mechanism of Greek Higher Education Institutes using Balanced Scorecard Technique