Keeping a Healthy Degree of AI Skepticism: Understanding AI Metrics in Sleep Medicine

By Daniel Rongo, MD; Scott Ryals, MD; Mattina A. Davenport, PhD; and Trung Le, PhD, on behalf of the AASM Artificial Intelligence in Sleep Medicine Committee

As artificial intelligence (AI) becomes integrated into the clinical workflow of sleep medicine, the clinician’s role is evolving. Clinicians should understand how AI models function and decide whether they perform as intended. Advanced expertise in computer science is not required to advocate for our patients. A working knowledge of the AI lifecycle, performance metrics, and common statistical pitfalls is sufficient to prevent being misled by impressive-looking numbers. These skills help to ensure AI improves care as the field of sleep medicine enters a new era.

The AI model lifecycle

Every AI model begins with defining a problem or objective. Relevant data are collected and preprocessed so that the model can learn patterns and predict outcomes (i.e., supervised learning). The model is then evaluated on separate test data not used during training (i.e., for testing/validating data). Demonstrating generalizability across diverse patient populations, institutions, and clinical settings is essential before deployment. Deployment is not the endpoint. Continuous monitoring is required to detect performance drift, bias, and unintended clinical consequences. One loop of this iterative life cycle is demonstrated in Figure 1. Ongoing oversight by scientists, administrators, and clinicians is necessary to maintain AI model safety and usefulness.

Figure 1: The AI model lifecycle

Managing metrics of performance

AI models optimize predictions based on a chosen objective. The challenge is deciding which performance metrics matter most clinically. Consider predicting a rare disease (e.g., narcolepsy) with a prevalence of 1%. Accuracy alone is meaningless in rare conditions. In this setting, minimizing false negatives is critical. The preferred metric becomes recall (sensitivity): the ability to detect true cases. Model training involves optimizing toward specific performance targets. However, optimizing one metric introduces tradeoffs. A model tuned for recall may increase false positives. Selecting the correct metric requires aligning performance goals with clinical risk.

A major concern is overfitting, when a model memorizes noise in the training patient data and fails on new patients. True validation requires testing on unseen, real-world data. Strong performance during development does not guarantee safe clinical performance.

Different metrics for different models

No single metric applies to every task. The appropriate metric depends on the clinical question. Table 1 summarizes common sleep medicine applications and the metrics most relevant to each. These are not exhaustive but provide a framework for evaluating claims about AI model performance.

Table 1: Examples of tasks for sleep medicine and the metrics that matter

Type of Model	Example (Sleep Medicine)	Metrics
Classification	Predicting epoch sleep stage or CPAP noncompliance	Accuracy, precision, recall, F1 score, area under the curve of a receiver operating characteristic (AUC ROC), Cohen’s kappa, Matthews correlation coefficient (MCC), calibration
Regression	Predicting apnea–hypopnea index (AHI) or total sleep time from physiologic or wearable data	Mean absolute error (MAE), root mean square error (RMSE), R², Pearson r, Bland–Altman bias
Time-series / signal processing	Detecting respiratory events or sleep stages (K-complex, sleep spindle, frequencies) from raw EEG/airflow signals using Fourier transform, wavelet transform, recurrent neural networks (RNNs), or transformers	Event-level sensitivity/specificity, AUC ROC, Cohen’s kappa vs. expert scoring, correlation coefficients, per-epoch accuracy
Causal inference models	Estimating the effect of PAP therapy on daytime sleepiness after controlling confounders	Average treatment effect (ATE), conditional ATE (CATE), overlap diagnostics, balance metrics
Generative AI / large language models	AI scribe systems for note writing	Hallucination rate, omission rate, attribution error rate, critical error rate, note acceptance rate, median time saved per note

Classification tasks, such as sleep staging or predicting CPAP adherence, are especially common in clinical decision-making. Table 2 describes metrics that appear frequently and deserve careful interpretation.

Table 2: Classification metrics, definitions, and considerations/pitfalls

Metric	Definitions	Considerations/Pitfalls
Accuracy	Proportion of all predictions that are correct	Can appear high when the model predicts the most common class (e.g., N2 sleep). Misleading in imbalanced datasets and should never be interpreted alone.
Sensitivity / Recall (R)	True positive rate; proportion of actual cases correctly identified	Critical for patient safety when missing disease has high consequences (e.g., OSA). A model can have high accuracy but dangerously low recall.
Precision (P) / PPV	Proportion of predicted positives that are truly positive	Optimizing precision alone can underdiagnose disease. Low precision increases false positives and strains patient and clinician resources.
F1-score	Harmonic mean of precision and recall	Summary metric balancing false positives and false negatives. Useful as a secondary score but can mask poor recall in rare conditions.
Matthews Correlation Coefficient (MCC)	Balanced measure using all four confusion matrix elements (TP, TN, FP, FN*)	Strong metric for imbalanced classes and event detection. Often more informative than accuracy but less familiar to clinicians.
Cohen’s kappa (κ)	(P_o − P_e) / (1 − P_e) Agreement between model and human scoring beyond chance**	Useful for comparing AI with expert PSG scoring. Reflects clinical agreement, not overall model optimization. Should be interpreted alongside other metrics.
Area under the curve of a receiver operating characteristic (AUC ROC)	Ability to discriminate across decision thresholds	Can appear high while still missing clinically important cases. Must be interpreted with recall and class balance.
Area under the precision-recall curve (AUC-PR)	Precision-recall performance emphasizing the positive class	More sensitive to rare events than AUC ROC. Drops sharply with poor recall or precision; better reflects performance in imbalanced sleep datasets.

*TP = true positive, TN = true negative, FP = false positive, FN = false negative

**P_o = observed agreement = accuracy (for the same unit), P_e = expected agreement by chance based on label marginals

Conclusion

AI models will increasingly shape how sleep clinicians diagnose, prognosticate, and document care. AI model performance depends on the objectives and the data used to train it. Metrics must be interpreted in the context of clinical risk to avoid bias, overdiagnosis, and missed treatment. Healthy skepticism entails aligning the highlighted metric with the clinical stakes. Clinicians do not need to become data scientists, but they must ask practical questions:

What problem is the AI model solving?
Which errors are most harmful?
Are development and deployment metrics transparent?

In sleep medicine, clinicians act as informed gatekeepers rather than passive adopters.

###

Further thoughts

Future priorities in sleep medicine should include promoting the use of interpretable AI model systems, maintaining oversight throughout the AI model lifecycle, and adopting standardized reporting checklists. These practices support fairness, reduce bias, and improve trust in clinical AI models.

This article appeared in volume 11, issue 2 of Montage magazine. 

References

Kocak B, Klontzas ME, Stanzione A, Meddeb A, Demircioğlu A, Bluethgen C, Bressem KK, Ugga L, Mercaldo N, Díaz O, Cuocolo R. Evaluation metrics in medical imaging AI: fundamentals, pitfalls, misapplications, and recommendations. Eur J Radiol Artif Intell. 2025;3:100030. https://doi.org/10.1016/j.ejrai.2025.100030

Practice Standards

Accreditation

Resources

AASM Link

Coding & Reimbursement

CMS – Medicare & Medicaid

Telemedicine

Remote Monitoring

Private Payer Advocacy

Quality Resources

Choose Sleep

Education & Training

Events & Courses

Visit the Career Center

Membership

Community

Other Resources

Join the AASM Community

About the AASM

Industry Engagement

Discover the AASM’s Impact

My Learning

Scoring Manual

ICSD-3-TR

Earn CME/CEC On-Demand

Keeping a healthy degree of AI skepticism: Knowing the metrics that matter and asking the right questions

The AI model lifecycle

Managing metrics of performance

Different metrics for different models

Table 1: Examples of tasks for sleep medicine and the metrics that matter

Table 2: Classification metrics, definitions, and considerations/pitfalls

Conclusion

Further thoughts

References

Further Reading

Resources

AASM Link

CMS – Medicare & Medicaid

Remote Monitoring

Quality Resources

Education & Training

Visit the Career Center

Community

Other Resources

Join the AASM Community

About the AASM

Industry Engagement

Discover the AASM’s Impact

Earn CME/CEC On-Demand

Keeping a healthy degree of AI skepticism: Knowing the metrics that matter and asking the right questions

The AI model lifecycle

Managing metrics of performance

Different metrics for different models

Table 1: Examples of tasks for sleep medicine and the metrics that matter

Table 2: Classification metrics, definitions, and considerations/pitfalls

Conclusion

Further thoughts

References

Further Reading

Share This Story, Choose Your Platform!

Related Posts