Questions: Although the US Preventive Services Task Force does not recommend screening older adults in primary care for dementia, it may be possible to conduct targeted screening in the highest-risk patients (e.g., those with a dementia prevalence ³90th percentile in a health plan’s population). How effectively would KP Washington’s prediction model, eRADAR, improve targeted screening? What’s the model’s risk of bias and applicability to KP Northwest? Are the model’s predictions unbiased across races/ethnicities?
Objectives: Assess eRADAR’s “restricted” model, which was designed for use with Epic’s electronic health record, according to the PROBAST tool.1 Assess bias in predictions across races/ethnicities as reported in an external validation publication.
Methods: Two epidemiologists assessed the restricted version of eRADAR’s development and validation publications2,3 according to the PROBAST tool to assess its risk of bias and applicability.1 An epidemiologist also assessed bias in predictions across races/ethnicities in the validation publication.
Results:
The results are organized according to the four domains of possible bias: patients, predictors, outcomes and analysis. Of the 20 PROBAST bias questions only those questions with a possible bias are noted below.
PROBAST questions about patients:
Low risk of bias; no concerns about applicability.
PROBAST questions about predictors:
Question 2.3 “Are all predictors available at the time the model is intended for use?”
No, a history of emergency department visits would be measured incompletely at KPNW when patients are risk-scored in routine practice because only half of emergency department visits occur at Sunnyside or Westside Hospitals and are documented in HealthConnect daily as they occur; insurance claims from OHSU and other area hospitals’ emergency departments may lag by ³90 days; consequently, the eRADAR prediction model would under-estimate those patients’ risks slightly. It’s unclear whether the coefficient for emergency visits is an importance weight from the LASSO logistic model or an odds ratio (OR=1.40) from a conventional maximum likelihood logistic model. Although 28% of patients visited the emergency department at least once in the preceding two years, we don’t know the occurrence of recent emergency visits.
No concerns about applicability.
PROBAST questions about outcomes:
Question 3.1 “Was the outcome determined appropriately?”
Probably not because the outcome was a diagnosis code assigned at one visit. The authors’ earlier publications suggest that the positive predictive value of a diagnosis code without a prescription for a dementia medication would < 0.40 (i.e., only 40% of patients with a diagnosis code and medication screen positive by the Mini-Mental Exam).4 The authors did not report the PPV for the outcome that they predicted: one visit’s diagnosis code without a prescription. The PPV may be far worse than 0.40.
No concerns about applicability.
PROBAST questions about analysis:
Question 4.1 “Were there a reasonable number of participants with the outcome?”
No, the authors should have used their entire cohort for model development or training. Although the number of unrecognized depression diagnoses in the training cohort (n=349) appears far too small for model development (PROBAST’s criterion of ³20 outcomes per parameter and 66 candidate parameters translates to a minimum of 1,320 outcomes), I calculated the required sample size for logistic regression using Stata’s prediction modeling sample size command: The authors only needed 7.18 outcomes per parameter given their c-statistic (0.81) and one-year incidence (0.12). With 66 candidate predictor parameters in the restricted model the authors should have required 474 outcomes, yet they analyzed only 349 outcomes, which may have resulted in uncertainty in the logistic regression’s tuning parameters and subsequent over-fitting of the model. However, the authors used the entire development cohort to estimate the predictor coefficients, which may have mitigated over-fitting for the external validation.
Yes, the authors had a sufficient number of outcomes in their hold-out, internal test cohort (n=149) and external validation cohort.
Question 4.2 “Were continuous and categorical predictors handled appropriately?”
No, five continuous (count) predictors of healthcare utilization were dichotomized (e.g.,³1 emergency department visit in preceding two years).
Question 4.3 “Were all enrolled participants included in the analysis?” [overlaps with Q 4.6]
Question 4.6 “Were complexities in the data (e.g., censoring, competing risks) accounted for appropriately?”
Probably not for both questions because patients in the model development cohort who had recognized dementia during routine clinical visits between the 24-month research visits were excluded from the analysis, which only allowed for two outcome classifications: no dementia or unrecognized dementia. Exclusion of those patients may have biased the coefficients for unrecognized dementia and the absolute occurrence of unrecognized dementia. For example, a patient’s natural history may have included many months of unrecognized dementia that was ultimately recognized by the 24-month research visit; those patients (50% of all their dementia outcomes) are missing from analysis and did not contribute to the absolute occurrence or percentile rankings. Exclusion also reduces applicability because those patients who will be recognized more rapidly are unknowable when eRADAR is used in routine clinical practice. eRADAR will assign those patients a risk score and percentile, but it wasn’t trained on them because they were excluded post-hoc. The composite outcome—recognized and unrecognized—would have retained all patients in the analysis. Other solutions, such as a multinomial logistic model, would have retained those patients in the analysis. The authors’ analysis of the composite demonstrated no improvement in the overall performance (c-statistic), but that does not reassure physicians about bias in the coefficients, bias in the absolute occurrence and concerns about applicability.
Question 4.8 “Were model over-fitting and optimism in model performance accounted for?”
Probably yes because of the 10-fold cross-validation of the LASSO tuning parameters in the development or training data. But we don’t have an estimate of optimism for the model’s final coefficients. The external validation used a different measure of the outcome in a more heterogeneous population, so the c-statistic and other performance measures are not informative about optimism in the original coefficients.
Commentary
The primary risk of bias concern was eRADAR’s choice of outcome measure in the validation study, which may have a low positive predictive value. eRADAR’s authors have never validated an outcome defined by a diagnosis code at one visit during the next year of follow-up. The only way to address that risk of bias would be to validate eRADAR in a KP population and determine how many of the apparent dementia diagnoses subsequently screen positive by the Mini Mental Exam (or similar tool) and how many of those diagnoses are confirmed clinically by a physician. That pragmatic PPV may be unknowable in routine practice because the US Preventive Services Task Force notes that approximately half of patients who screen positive refuse diagnostic testing for dementia.5
A secondary risk of bias concern was the development study’s exclusion from the analysis of patients whose dementia was recognized clinically. We cannot determine whether their exclusion biased eRADAR’s coefficients for predictors. Because half of all dementia cases were recognized clinically before the 24-month research visit, the absolute occurrence and percentile rankings are probably biased. eRADAR may assign an erroneous risk score or percentile to those patients whose dementia will be recognized rapidly because eRADAR’s logistic model was not trained or developed with those patients.
It’s unclear whether the inadequate number of outcomes during model development resulted in over-fitting and optimism in the predictor coefficients that define risk percentiles. But if the model’s discrimination has been over-estimated at a c-statistic of 0.81, a modest decline in discrimination would still mean that eRADAR’s restricted model is effective at separating higher-risk patients (e.g., a c-statistic ³0.75).
The authors over-state the successful calibration of their restricted eRADAR model: “The calibration plots (Figure 1B) suggest reasonable correspondence between observed and predicted risk across the full range of scores.” Figure 1B shows that the model under-predicts risk in patients ³80th percentile for whom screening with the Mini-Mental Exam or other tools may be recommended. Because there were only 149 dementia outcomes in the test data used to assess calibration, the estimate is imprecise: The 95% confidence interval is compatible with excellent calibration and poor calibration (i.e., an observed risk >0.10). Given that intended use is for population screening and predicted probabilities of dementia would not be communicated to patients—as probabilities would in shared decision-making—the degree of under-estimation of risk may be acceptable clinically.
The authors assessed model performance by races and ethnicities in the KP Washington population using the area under the receiver operating characteristic curve (AUC), sensitivity at the 85th percentile for defining high-risk and positive predictive value (also at the 85th percentile). eRADAR’s performance was consistent across races and ethnicities.
eRADAR was designed to work with patients who had at least two years of continuous membership with KP to ascertain predictor characteristics. The model has never been validated in patients with a briefer history and may suffer lower sensitivity and more false-negative predictions (i.e., the patient does not appear to be at high risk for dementia because HealthConnect has documented an insufficient clinical history). If the model were put into routine practice at KP Northwest, where some elderly patients do not have two years of history, eRADAR’s performance would be worse than reported in the validation at KP Washington.
In their validation study at KP Washington the authors report that patients ³90th percentile of predicted risk according to the restricted version of eRADAR exhibit a positive predictive value of 4.8. The authors’ previous validation study, which they do not cite, suggests that few than 40% of those apparent dementia diagnoses would screen positive by the Mini-Mental Exam and that only half of those would be confirmed as dementia by a physician.4 That means the pragmatic PPV would be closer to 1.0 than 4.8 (i.e., the number needed to screen with the Mini-Mental Exam to identify one confirmed case of dementia is approximately 100).
References
1. Moons KGM, Wolff RF, Riley RD, et al. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies: explanation and elaboration. Ann Intern Med 2019;170:W1-W33.
2. Barnes DE, Zhou J, Walker RL, et al. Development and validation of eRADAR: a tool using HER data to detect unrecognized dementia. J Am Geriatric Assoc 2020;68:103-111.
3. Coley RY, Smith JJ, Karliner L, et al. External validation of the eRADAR risk score for detecting undiagnosed dementia in two real-world healthcare systems. J Gen Intern Med 2022;38:351-60.
4. Harding BN, Floyd JS, Scherrer JF, et al. Methods to identify dementia in the electronic health record: comparing cognitive test scores with dementia algorithms. Healthcare 2020;8:100430.
5. Patnode CD, Perdue L, Rossom RC, et al. Screening for cognitive impairment in older adults: Updated evidence report and systematic review for the US Preventive Services Task Force. JAMA 2020;323:764-85.