BENCHMARKING CALIBRATED GRADIENT BOOSTING WITH BEHAVIOURAL FEATURES FOR TRANSACTION-LEVEL AML DETECTION, СРАВНИТЕЛЬНАЯ ОЦЕНКА КАЛИБРОВАННОГО ГРАДИЕНТНОГО БУСТИНГА С ПОВЕДЕНЧЕСКИМИ ПРИЗНАКАМИ ДЛЯ ВЫЯВЛЕНИЯ ОТМЫВАНИЯ ДЕНЕГ НА УРОВНЕ ТРАНЗАКЦИЙ, ТРАНЗАКЦИЯЛАР ДЕҢГЕЙІНДЕ АҚШАНЫ ЖЫЛЫСТАТУДЫ АНЫҚТАУҒА АРНАЛҒАН МІНЕЗ-ҚҰЛЫҚТЫҚ БЕЛГІЛЕРГЕ НЕГІЗДЕЛГЕН КАЛИБРЛЕНГЕН ГРАДИЕНТТІ БУСТИНГТІҢ САЛЫСТЫРМАЛЫ БАҒАЛАНУЫ

Authors: T.K. Azbergen, A.O. Tleubayeva, A.T. Aituarov et al.

Publication: Вестник КазУТБ

Published: Mar 30, 2026

Source: Crossref

Back to Search View Original Cite This Article

Abstract

<jats:p>Reliable transaction-level anti-money-laundering (AML) systems require not only accurate ranking of suspicious activity but also well-calibrated probabilistic risk estimates to support operational decision-making and risk prioritisation. In this context, this work provides an empirical evaluation of calibrated gradient boosting models with temporal behavioural features, focusing on the combined impact of probabilistic calibration and behavioural feature modelling on probability reliability and operationally relevant performance metrics. The proposed framework integrates gradient boosting algorithms (CatBoost and XGBoost) with post-hoc probability calibration techniques (Platt scaling and isotonic regression) and behavioural features computed over rolling time windows of 1, 7, and 30 days to capture both short-term volatility and longer-term transaction patterns. Experimental evaluation is conducted on the publicly available IBM Anti-Money Laundering (IBM–AML) dataset, which contains approximately seven million synthetic transactions simulating realistic banking activity. Model performance is assessed using ranking-, calibration-, and operationally oriented metrics, including AUPRC, AUROC, Brier score, Precision@Top k, and Lift@Top k. Among the evaluated baselines, the calibrated CatBoost model achieved the strongest overall balance between ranking quality and probability reliability (AUPRC = 0.367, AUROC = 0.959, Brier = 0.0062, Lift@1% = 40.9). The results indicate that probability calibration improves the reliability of predicted risk scores without degrading ranking performance, while temporal behavioural features contribute to improved detection sensitivity in highly imbalanced settings. This study provides benchmark-based empirical evidence regarding the role of probability calibration and behavioural dynamics in AML systems and offers a reproducible experimental pipeline for future comparative research. The applicability of the findings to real-world AML systems is discussed in light of the synthetic nature of the dataset, and directions for further validation on real banking data are outlined.</jats:p>

Keywords

behavioural calibration probability ranking systems

Abstract

Keywords

Related Articles

Enhancing fake news detection using light gradient boosting machine and term frequency-inverse document frequency-based algorithms

Optimal Stance Detection in Social Media through Hybrid Machine Learning and Deep Learning Models with Adaptive Gradient Descent Optimisation

Benchmarking Unstructured Community Service Against the National SULAM Framework: A Case Study of Soft Skill Deficiencies in Private Higher Education

Coupled damage evolution of Cf/SiC composites across a full oxygen concentration gradient at elevated temperature via synchrotron in-situ CT and deep learning

Developing of motorcycle driving cycles with road gradient using machine learning techniques: A case study in Medellín, Colombia