Abstract
<jats:p>Reliable transaction-level anti-money-laundering (AML) systems require not only accurate ranking of suspicious activity but also well-calibrated probabilistic risk estimates to support operational decision-making and risk prioritisation. In this context, this work provides an empirical evaluation of calibrated gradient boosting models with temporal behavioural features, focusing on the combined impact of probabilistic calibration and behavioural feature modelling on probability reliability and operationally relevant performance metrics. The proposed framework integrates gradient boosting algorithms (CatBoost and XGBoost) with post-hoc probability calibration techniques (Platt scaling and isotonic regression) and behavioural features computed over rolling time windows of 1, 7, and 30 days to capture both short-term volatility and longer-term transaction patterns. Experimental evaluation is conducted on the publicly available IBM Anti-Money Laundering (IBM–AML) dataset, which contains approximately seven million synthetic transactions simulating realistic banking activity. Model performance is assessed using ranking-, calibration-, and operationally oriented metrics, including AUPRC, AUROC, Brier score, Precision@Top k, and Lift@Top k. Among the evaluated baselines, the calibrated CatBoost model achieved the strongest overall balance between ranking quality and probability reliability (AUPRC = 0.367, AUROC = 0.959, Brier = 0.0062, Lift@1% = 40.9). The results indicate that probability calibration improves the reliability of predicted risk scores without degrading ranking performance, while temporal behavioural features contribute to improved detection sensitivity in highly imbalanced settings. This study provides benchmark-based empirical evidence regarding the role of probability calibration and behavioural dynamics in AML systems and offers a reproducible experimental pipeline for future comparative research. The applicability of the findings to real-world AML systems is discussed in light of the synthetic nature of the dataset, and directions for further validation on real banking data are outlined.</jats:p>