By Daniel Rongo, MD; Scott Ryals, MD; Mattina A. Davenport, PhD; and Trung Le, PhD, on behalf of the AASM Artificial Intelligence in Sleep Medicine Committee

As artificial intelligence (AI) becomes integrated into the clinical workflow of sleep medicine, the clinician’s role is evolving. Clinicians should understand how AI models function and decide whether they perform as intended. Advanced expertise in computer science is not required to advocate for our patients. A working knowledge of the AI lifecycle, performance metrics, and common statistical pitfalls is sufficient to prevent being misled by impressive-looking numbers. These skills help to ensure AI improves care as the field of sleep medicine enters a new era.

The AI model lifecycle

Every AI model begins with defining a problem or objective. Relevant data are collected and preprocessed so that the model can learn patterns and predict outcomes (i.e., supervised learning). The model is then evaluated on separate test data not used during training (i.e., for testing/validating data). Demonstrating generalizability across diverse patient populations, institutions, and clinical settings is essential before deployment. Deployment is not the endpoint. Continuous monitoring is required to detect performance drift, bias, and unintended clinical consequences. One loop of this iterative life cycle is demonstrated in Figure 1. Ongoing oversight by scientists, administrators, and clinicians is necessary to maintain AI model safety and usefulness.

Figure 1: The AI model lifecycle

Managing metrics of performance

AI models optimize predictions based on a chosen objective. The challenge is deciding which performance metrics matter most clinically. Consider predicting a rare disease (e.g., narcolepsy) with a prevalence of 1%. Accuracy alone is meaningless in rare conditions. In this setting, minimizing false negatives is critical. The preferred metric becomes recall (sensitivity): the ability to detect true cases. Model training involves optimizing toward specific performance targets. However, optimizing one metric introduces tradeoffs. A model tuned for recall may increase false positives. Selecting the correct metric requires aligning performance goals with clinical risk.

A major concern is overfitting, when a model memorizes noise in the training patient data and fails on new patients. True validation requires testing on unseen, real-world data. Strong performance during development does not guarantee safe clinical performance.

Different metrics for different models

No single metric applies to every task. The appropriate metric depends on the clinical question. Table 1 summarizes common sleep medicine applications and the metrics most relevant to each. These are not exhaustive but provide a framework for evaluating claims about AI model performance.

Table 1: Examples of tasks for sleep medicine and the metrics that matter 

Type of Model Example (Sleep Medicine) Metrics
Classification Predicting epoch sleep stage or CPAP noncompliance Accuracy, precision, recall, F1 score, area under the curve of a receiver operating characteristic (AUC ROC), Cohen’s kappa, Matthews correlation coefficient (MCC), calibration
Regression Predicting apnea–hypopnea index (AHI) or total sleep time from physiologic or wearable data Mean absolute error (MAE), root mean square error (RMSE), R2, Pearson r, Bland–Altman bias
Time-series / signal processing Detecting respiratory events or sleep stages (K-complex, sleep spindle, frequencies) from raw EEG/airflow signals using Fourier transform, wavelet transform, recurrent neural networks (RNNs), or transformers Event-level sensitivity/specificity, AUC ROC, Cohen’s kappa vs. expert scoring, correlation coefficients, per-epoch accuracy
Causal inference models Estimating the effect of PAP therapy on daytime sleepiness after controlling confounders Average treatment effect (ATE), conditional ATE (CATE), overlap diagnostics, balance metrics
Generative AI / large language models AI scribe systems for note writing Hallucination rate, omission rate, attribution error rate, critical error rate, note acceptance rate, median time saved per note

Classification tasks, such as sleep staging or predicting CPAP adherence, are especially common in clinical decision-making. Table 2 describes metrics that appear frequently and deserve careful interpretation.

Table 2: Classification metrics, definitions, and considerations/pitfalls

Metric Definitions Considerations/Pitfalls
Accuracy Proportion of all predictions that are correct Can appear high when the model predicts the most common class (e.g., N2 sleep). Misleading in imbalanced datasets and should never be interpreted alone.
Sensitivity / Recall (R) True positive rate; proportion of actual cases correctly identified Critical for patient safety when missing disease has high consequences (e.g., OSA). A model can have high accuracy but dangerously low recall.
Precision (P) / PPV Proportion of predicted positives that are truly positive Optimizing precision alone can underdiagnose disease. Low precision increases false positives and strains patient and clinician resources.
F1-score Harmonic mean of precision and recall Summary metric balancing false positives and false negatives. Useful as a secondary score but can mask poor recall in rare conditions.
Matthews Correlation Coefficient (MCC) Balanced measure using all four confusion matrix elements (TP, TN, FP, FN*) Strong metric for imbalanced classes and event detection. Often more informative than accuracy but less familiar to clinicians.
Cohen’s kappa (κ)

(Po − Pe) / (1 − Pe)

Agreement between model and human scoring beyond chance**
Useful for comparing AI with expert PSG scoring. Reflects clinical agreement, not overall model optimization. Should be interpreted alongside other metrics.
Area under the curve of a receiver operating characteristic (AUC ROC) Ability to discriminate across decision thresholds Can appear high while still missing clinically important cases. Must be interpreted with recall and class balance.
Area under the precision-recall curve (AUC-PR) Precision-recall performance emphasizing the positive class More sensitive to rare events than AUC ROC. Drops sharply with poor recall or precision; better reflects performance in imbalanced sleep datasets.

*TP = true positive, TN = true negative, FP = false positive, FN = false negative

**Po = observed agreement = accuracy (for the same unit), Pe = expected agreement by chance based on label marginals

Conclusion

AI models will increasingly shape how sleep clinicians diagnose, prognosticate, and document care. AI model performance depends on the objectives and the data used to train it. Metrics must be interpreted in the context of clinical risk to avoid bias, overdiagnosis, and missed treatment. Healthy skepticism entails aligning the highlighted metric with the clinical stakes. Clinicians do not need to become data scientists, but they must ask practical questions:

  • What problem is the AI model solving?
  • Which errors are most harmful?
  • Are development and deployment metrics transparent?

In sleep medicine, clinicians act as informed gatekeepers rather than passive adopters.

###

Further thoughts

Future priorities in sleep medicine should include promoting the use of interpretable AI model systemsmaintaining oversight throughout the AI model lifecycle, and adopting standardized reporting checklists. These practices support fairness, reduce bias, and improve trust in clinical AI models. 

This article appeared in volume 11, issue 2 of Montage magazine.  

References

Kocak B, Klontzas ME, Stanzione A, Meddeb A, Demircioğlu A, Bluethgen C, Bressem KK, Ugga L, Mercaldo N, Díaz O, Cuocolo R. Evaluation metrics in medical imaging AI: fundamentals, pitfalls, misapplications, and recommendations. Eur J Radiol Artif Intell. 2025;3:100030. https://doi.org/10.1016/j.ejrai.2025.100030

Further Reading

  • Bandyopadhyay A, Bae C, Cheng H, et al. Smart sleep: what to consider when adopting AI-enabled solutions in clinical practice of sleep medicine. J Clin Sleep Med. 2023;19(10):1823-1833. https://doi.org/10.5664/jcsm.10702
  • Bandyopadhyay A, Oks M, Sun H, et al. Strengths, weaknesses, opportunities, and threats of using AI-enabled technology in sleep medicine: a commentary. J Clin Sleep Med. 2024;20(7):1183-1191. https://doi.org/10.5664/jcsm.11132
  • Goldstein CA, Berry RB, Kent DT, et al. Artificial intelligence in sleep medicine: background and implications for clinicians. J Clin Sleep Med. 2020;16(4):609-618. https://doi.org/10.5664/jcsm.8388