AI Algorithms for Disease Detection: Methodological Decisions for Development of Models Validated Through a Clinical, Analytical, and Commercial Lens

AI Algorithms for Disease Detection: Methodological Decisions for Development of Models Validated Through a Clinical, Analytical, and Commercial Lens
Brian Malpede, Manager; Goksu Dogan, Principal; Scott Moreland, Data Scientist; Rabe’e Cheheltani, Consultant; Brittany Fischer, Associate Consultant; Suyin Lee, Manager; Nadea Leavitt, Principal and US Lead; Orla Doyle, Lead Data Scientist; John Rigg, Senior Principal and Global Lead, IQVIA Predictive Analytics
Disease detection driven by artificial intelligence (AI) has demonstrated to be an effective tool for identifying undiagnosed patients with complex common as well as rare diseases. The use of these algorithms is driven by awareness that underdiagnosis leads to a heavy burden for patients and healthcare professionals, and is also a challenge for pharmaceutical companies seeking to expand the patient pool for their medications, whether to power clinical trials or to efficiently target healthcare providers (HCPs). However, despite widespread awareness and usage of this application, methodologies utilized are highly variable and learnings are rarely shared. In addition, the commercial application of models built for pharmaceutical companies is not always considered during model development stages, despite the importance of methodological decisions to the efficient and successful real-time implementation of AI driven diagnostics. In this paper, a cross-functional methodological approach to AI algorithm design for undiagnosed patient detection will be detailed, an approach honed through the development of numerous algorithms applied to a wide-range of diseases, from common to ultra-rare, in diverse therapeutic areas. Methodological and technical considerations will be described that consider relevant aspects of clinical, analytical, and commercial environments to develop an AI solution that is statistically robust, clinically relevant, interpretable, and operationally tenable.

Keywords: Artificial Intelligence, Machine Learning, Predictive Analytics, Disease Detection, Rare Disease, Algorithms

1.0 Introduction
Disease detection algorithms driven by artificial intelligence (AI) have demonstrated to be an effective tool for identifying undiagnosed patients with underdiagnosed, un-coded, and rare diseases. The application of these algorithms is greatly influenced by challenges that patients and healthcare professionals face, as well as those encountered by pharmaceutical companies trying to expand the pool of candidate patients for their medications, whether to power clinical trials or to efficiently target healthcare providers (HCPs).

Despite the popularity and widespread use of disease detection algorithms, the methodology design varies highly across studies and insights into best practices are rarely shared. Developing an effective disease detection algorithm is a multifaceted solution involving technical, clinical, and operational expertise. These capabilities are essential in informing each step of study design, model development, and deployment. Clinical validation and interpretation of the model is equally important to the evaluation and the optimization of advanced AI techniques. Further, the implications on business operations, which are key to the development and real-time implementation of AI driven diagnostics, are often overlooked during model development and deployment phases.

In this paper, we detail a cross-functional methodological approach to AI algorithm design for undiagnosed patient detection, established over several years and applied to various diseases, ranging from common to ultra-rare. We describe methodological and technical considerations that reflect relevant aspects of clinical, analytical, and commercial environments to develop an AI solution that is statistically sound, clinically relevant, interpretable, and operationally tenable. We will focus on three main areas, including:
  1. Application of analytical techniques that drive robust clinical and statistical validation as well as interpretability and insight of AI models
  2. Inputs and techniques that foster development of a model that is appropriate and actionable for the desired commercial implementation
  3. The outlook for building and utilizing diagnostic algorithms developed with AI
2.0 The Process of Building a Model
The primary and essential elements of building a model to predict diagnosis are consistent across disease states and applications. Details within each step may be variable, but the overall process can be summarized into five main steps, shown in Figure 1.

Figure 1: The Five Main Steps in Developing an AI Model Designed for Prediction of Undiagnosed Disease

Source: IQVIA illustration
2.1 Selecting a Dataset and Building Cohorts
Selection of the dataset is a key aspect of AI modeling. The dataset is important for several reasons, including:
  • The identification of patients from which the model will learn
  • The type and volume of predictors that can be leveraged
  • The ongoing business application
Important considerations for selection include a balance of cost, patient coverage, and application (i.e. clinical, commercial, etc.). Adjudicated or non-adjudicated medical claims and electronic health records (EHR) are commonly used or considered datasets. Additional datasets that might be used to supplement modeling include patient registries, lab claims, and consumer data.

The first and most important question to consider is model application and key goals for the modeling effort. For clinical trial recruitment and pharmaceutical marketing, the goals include broad and timely identification of new potential patients and their HCPs. As such, an open claims dataset with robust coverage of patients and HCPs and near real-time updates would be most applicable. Using closed claims for this application is a significant disadvantage given there is an extended lag time in the data. In contrast, for development of a clinical decision tool, the goals might include clinical insight or diagnostic indicators and limited disruption of clinician workflow. For this specific application, an EHR dataset would be more appropriate as it would mimic the environment in which it would be deployed.

Additional datasets, such as lab data, patient registries, and specific consumer data, can supplement modeling. These datasets serve two main purposes, including the identification of known diseased patients (e.g. through disease-specific lab/genetic testing), and profiling of patient subgroups to gain insights into the studied patient population (e.g. through consumer attributes). While all datasets have individual value, researchers should use caution in considering using all datasets for the same model. Cost and complexity for both initial development and ongoing deployment compared to gain in model performance should be evaluated and balanced appropriately.
2.2 Patient Cohort Design
After selecting the appropriate dataset, the next critical element of disease detection modeling is the development of clean, validated patient cohorts, or the sets of patients from which the model will learn to differentiate disease from non-disease. These groups of patients are often referred to as positive cohort (with disease) and negative cohort (without disease).
2.3 Selection of a Positive Cohort
In the simplest form, positive patients can be selected based on defined criteria that is indicative of the disease of interest. For example, if the goal of the model is to predict patients diagnosed with shingles, the selection criteria could be defined as evidence of a claim with the ICD-10 diagnosis code specific to the disease (Herpes Zoster, B02 family of codes). However, selection of positive cohorts may not always be this straightforward, and more complex steps to identify a validated positive cohort may be necessary. Several examples of more complex scenarios are discussed below, including:
  1. The use of multiple claims for the disease of interest to increase confidence in the diagnosis
  2. The use of “proxies” such as: 1) treatments indicated exclusively for a target disease state or 2) the combination of multiple diagnostic codes that together define a specific disease state
  3. A selection period after October 2015 for diseases in which the single ICD-9 code is shared among multiple diseases, where only patients with a definitive and non-shared ICD-10 code are included as positive
  4. The use of a supplemental data source such as EHR, lab, or patient registry
Scenario A. The existence of a single occurrence of an ICD-10 in a patient’s medical history may not be indicative of the patient truly having the disease, but rather of testing for the disease. In these instances, criteria for a positive patient can be refined to the requirement of at least two instances of the ICD-10 code specific to the disease. This selection can drive a higher confidence of confirmatory diagnosis for positive patients, eliminating patients that may have been tested for disease but not ultimately diagnosed.

Scenario B. In some cases, the disease of interest may have a non-specific ICD code, where the code itself is shared among diseases of the same family or is used for patient populations outside of the disease of interest. For example, if the model is being developed to predict patients with hereditary angioedema (HAE), a challenge that arises is that the ICD-10 code for HAE (D84.1) is shared with other forms of angioedema. In this case, other proxies of a confirmed disease state in a patient’s medical history should be considered to select those diagnosed with only the disease of interest, and thus the cleanest sample of positive patients with which to train a model. Once again using HAE as an example, a specific diagnosis may be defined as patients with evidence of treatments indicated exclusively for HAE, or through evidence of a combination of diagnosis codes for HAE (i.e. when a patient has evidence of both the ICD-10 D84.1 and broad ICD-9 277.6, which codes for numerous disorders under “deficiencies of circulating enzymes”) along with evidence of non-specific treatments used for management of the disease.

Scenario C. Many rare diseases may have a specific ICD code within version 10, but within the ICD-9 definitions shared diagnostic coding with a broader group of conditions. In these cases, a positive patient selection period can be limited to after October 2015, the beginning of ICD-10 release in the U.S., to help ensure a clean positive cohort. The definition of first instance of diagnosis for such patients, however, should be based on the first observed diagnosis code in their history, whether from the shared ICD-9 or exclusive ICD-10 version, such that the timing of the patient’s initial diagnosis is identified with highest confidence.

Scenario D. Finally, a single data source may not always be enough to identify patients of interest. Situations exist in which a group of positive patients cannot be identified solely through diagnosis and treatment coding in claims data, such as when there is simply no ICD code or available proxy. In these cases, addition of other data sources such as EHR, patient registries, or lab results (e.g. genetic testing) may be beneficial. As an example, to identify a disease severity not captured in ICD coding, EHR data can be used to reveal patients with evidence from provider notes of the disease state of interest. These patients can then be linked back to claims data (or other datasets) for model training.
2.4 Selection of a Negative Cohort
At a basic level, patients in the negative cohort can be selected based on the absence of evidence for the disease of interest. In some cases, further filtering of the cohort can eliminate patients who are ineligible to have the disease and thus would not serve as suitable comparators. For example, male patients should be excluded from a negative cohort for a model identifying patients with endometriosis, a disease of the uterus, and elderly patients would be inappropriate to include for a model focused on a pediatric disease.

In some instances, lack of evidence for disease is not necessarily indicative of a patient that is not affected, but rather a result of limitations in data coverage, coding practices, or under-diagnosis. An understanding of estimated prevalence of unknown or unlabeled positives, which could be wrongly labeled as negatives, is helpful in approximating potential impact on model performance. Negative patients are often selected from a random sample of the general population with no evidence of the disease of interest, and thus for modeling in rare and ultra-rare diseases, the existence of false negatives (positive patients that are not identified as such due to the limitations mentioned above) is typically negligible because of the low prevalence of disease. In other instances, however, such as a case of finding patients with common (typically under-coded or under-diagnosed) diseases, the existence of these ‘unknown positives’ wrongly labeled as negatives could have substantial implications on model development, with the biggest impact on potential incorrect measurement of model performance. Techniques that seek to understand or mitigate the impact of unlabeled positives are discussed in a later section.

Cohort selection is crucial to the modeling process, but researchers should avoid the urge to overclean the cohorts, thus running the risk of introducing bias, reducing sample size, and hindering model efficacy. The introduction of bias is an especially critical consideration in study design. Main sources of bias include changing data coverage, seasonality, market events (diagnostic or procedure coding changes, patient pathway updates, new treatments introduced), and inappropriate selection criteria that lead to an improper positive cohort. Mitigation of these sources of bias reside in the proper selection of cohorts as outlined above as well as in suitable selection of a study time period to be utilized for model training.
3.0 Appending Features (Predictor Design)
Following development of positive and negative cohorts, the next step is to append potential predictors, or in other words, identify the medical history that will be used to train the model. The importance of developing a set of predictors (also called features, model inputs, or variables) cannot be understated. These inputs form the basis for how a model makes its decisions about patient predictions and are thus critical in both driving the predictions themselves and in gaining clinical insight from the model. There are two main approaches to predictor generation: an automated data-driven process, and a hypothesis, or knowledge-driven process. Finding a balance between the classical hypothesis-driven and automated data-driven feature generation is essential for an interpretable and operational model.
3.1 Hypothesis (Knowledge) Driven Features
Hypothesis, or knowledge-driven feature generation, is valuable in that it allows for testing of predictors considered to be clinically important. As such, these predictors are easy to interpret and simple to understand in clinical terms. However, despite their interpretability, these predictors may not capture all relevant aspects of medical history for a specific disease. An example of this type of feature is the roll up of a set of diagnostic ICD-10 coding for abdominal pain. This physical manifestation is captured in several individual ICD-10 codes, each of which defines a specific region of the abdomen as well as the type of pain. A classical knowledge-driven feature would usually contain each of these codes “rolled-up” into one clinical bucket, such that the only information the model sees is abdominal pain, and none of the underlying specificity from individual clinical claims codes. Several mapping systems, such as SNOMED,1 exist to align coding with clinically relevant roll-ups, allowing for straightforward assembly of interpretable knowledge-driven clinical features.
3.2 Data-Driven Features
To leverage additional medical history, an automated data-driven process can be utilized in conjunction with hypothesis-driven features. This process involves selecting features based on the data alone to define inputs for modeling. Leveraging the data to define potential predictors often leads to an initial assessment of thousands of different features. This lengthy list of features can then be narrowed down using a variety of selection techniques to a set of those that are most relevant. Using the data in such a way can reveal previously unknown predictors and is especially useful in disease states where there is limited understanding of the patient journey. While this process is valuable, one caveat with data-driven features is that they are often presented with substantial granularity (such as a single CPT or a single ICD-10 code). This granularity can be helpful but may also result in challenges with clinical interpretation of model decision making.
4.0 Building a Model and Assessing Performance
With positive and negative patient cohorts selected, and a set of features built for input into the modeling process, binary classification is the intuitive solution to develop diagnostic algorithms. Several modeling techniques are typically employed for model training, including logistic regression, random forest, gradient boosting, and neural networks, all of which can work quite well depending on the circumstances. With that said, decision tree algorithms based on gradient boosting (such as XGBoost)2 are found to work particularly well for the domain of disease detection (See Figure 2).

Figure 2: Comparison of Model Performance for Several Techniques

Comparison of logistic regression, random forest, and XGBoost for a model trained to predict undiagnosed patients with an ultra-rare neuromuscular disorder. XGBoost outperforms the other techniques for detection of this disease. Precision: proportion of positive patients correctly identified by the model; Recall: proportion of positives identified by the model out of all known positives. Source: IQVIA case study.

4.1 Model Validation
Validation of the model’s performance is critical for assessing commercial, clinical, or other applicability. This process helps define how effectively a model can identify undiagnosed patients in a real-world commercial setting as well as try to understand a model’s complex decision-making process. There are several ways to approach performance measurement, including the precision-recall (PR) curve alluded to above, the F1 score (equal to the harmonic mean of precision and recall), and the area under the receiver operator curve (AUC) (defined as the integral of the model’s true positive rate as a function of its false positive rate).

The F1 and AUC scores examine model performance as a single number, making them popular choices for performance ranking and hyperparameter optimization. Their definitions, however, average and integrate over competing terms in the confusion matrix – e.g. the true positive rate and false positive rate – rendering them unsuitable for nuanced applications where a certain region of the precision-recall or receiver operating curve is of particular interest, as is most often the case for disease detection applications. In these situations, a more flexible variant of the F1 score known as the generalized Fβ score can be used to quantify model performance, biased to higher or lower recall levels by tuning the value of the parameter β. This value is best suited for instances where there is significant imbalance between the positive and negative patient populations.
4.2 Evaluation of a Precision Recall Curve
While the Fβ and AUC measures can be useful for ranking and optimization through measurement of overall performance, they contain only a fraction of the information encoded in the full PR curve. The PR curve is typically the most valuable metric as it provides an intuitive assessment of model performance as well as highly actionable outputs. In addition, the curve allows for an adjustable threshold to suit multiple commercial and clinical deployment initiatives. Identifying a recall threshold at which a patient is identified as high likelihood (with highest levels of precision) versus a patient that is lower likelihood (lower levels of precision) allows for targeted differentiation of predicted patient candidates. Examples of applications include choosing a personal vs. non-personal promotion in a commercial setting or advocating for an expensive diagnostic test versus less costly patient monitoring in a clinical setting.

Reviewing a PR curve is highly useful in understanding a model’s real-world application through quantification of potential performance when implemented. However, the PR curve assessment must be used appropriately. Specifically, the curve must be calculated using a representative ratio of positive to negative patients, or in other words, the model should be assessed for performance on a set of patients that reflects the expected real-world population. If the model is provided a 1:1 ratio of positive to negative patients for testing, the number of false positives (patients predicted to have disease that do not actually have the disease) will be grossly underestimated.

In contrast, if the ratio is set according to expected prevalence of disease in the population, the actual expected number of false positives, and therefore a true understanding of real-world application, will be defined. The example PR curve in Figure 3, generated for a model designed to detect an ultra-rare neuromuscular disorder, demonstrates the importance of utilizing a representative ratio for model evaluation. Take, for instance, the point on the curve at 10% recall. When the precision of the model is evaluated on a 1:1 ratio it is nearly 100%. Even at a 1:1,000 ratio the precision is approximately 95%. This precision is deceptively high or artificially inflated as the true precision is 28% when evaluated on a ratio that best approximates real-world disease prevalence.

Figure 3: PR Curves Adjusted Across Changing Ratio of Positive to Negative Patients

A PR curve should be calculated for a representative ratio of positive to negative patients for diagnostic modeling. If not, the curve, and thus evaluation of model performance, can be misleading and lead to inappropriate assessment of model performance. Each of the PR curves in this image are calculated for the same model at different ratios of positive (diseased) to negative (control/non-diseased) patients. Source: IQVIA case study.

The definition of good performance is thus dependent on the specific modeling exercise and disease of interest. While the actual value of precision may be low (see Figure 3), the relative increase in performance relative to selecting patients at random, a measurement that can be based on disease prevalence, is a more apt way to assess the function of a model.
4.3 Considerations for Model Training with the Expectation of False Negatives
In situations where there is the expectation of an unclean negative cohort, or in other words that the proportion of false negatives (also called unlabeled positives) will be high, an approach known as positive and unlabeled (PU) learning can help.4 An example of a situation in which this type of AI learning could be appropriate is a disease in which a significant portion of the diagnosed population are un-coded in claims data. These situations may arise due to stigma associated with the disease, or in the absence of specific treatments and thus limited incentive or awareness for HCPs to submit a claim for the disease itself. This issue may impede the model’s ability to learn and differentiate positive from negative, and thus adversely affect the patterns and profiles that the model leverages when applied to a real-world dataset for undiagnosed patient identification.

Here the presence of unlabeled/unknown positives in the claims data can be inferred by comparing the observed incidence of disease to clinical estimates of the disease’s published incidence or prevalence. In practice, these quantities can disagree for a number of reasons, including incomplete data capture and fundamental differences between clinical reality and the actions of health care providers with regard to clinical coding and documentation. If the goal of a study is to detect all positive patients, even those which may go undiagnosed in the absence of external intervention, then a model grossly underreports existence of positive patients since it uses the features for those identified as positives as distinguished from those marked as negative. Due to the negative class having a significant number of unlabeled positives, the features that would be used to identify a positive are severely diluted.

One intriguing PU learning method proposed in the literature is to use “spies” to identify clean negative examples in the pool of unlabeled examples. These patients can then be used with the known positives to train a traditional binary classifier.5 The spies are randomly sampled positive examples that are artificially injected into the pool of unlabeled examples during training. The unlabeled examples (that are known positives) are then modeled as if they were purely negative, and a traditional classifier is trained on the resulting positive and negative examples. Specifically, the largest possible decision threshold t is found such that only a small fraction f of spies, e.g. f = 10%, have a classifier score smaller than t. It is assumed that the examples with scores less than t are mostly clean negatives. The clean negatives are finally combined with the known positives and used to train a second stage traditional classifier with a high purity negative cohort. Once the second stage classifier is trained, it can be used to score previously unseen examples which pull from the same overall distribution of positive and unlabeled examples, with the goal of more effective identification of true positive versus true negative patients.
5.0 Interpreting Model Results
One of the main advantages of AI algorithms is the ability to detect patterns in big data invisible to the human eye from thousands of features generated without a priori hypotheses. However, increasing complexity of modeling approaches comes with reduced interpretability, rendering perception that many models are “black box.” Interpretability of model decisions is critical in validating the model’s efficacy, or in other words, building confidence that the model is thinking correctly about potentially undiagnosed patients. Several techniques can be employed to help achieve an understanding of model behavior, including predictor importance, relative risks, and SHAP.

A note here is that machine learning algorithms are—at their most basic level—geometric structures that live in multi-dimensional “feature space”. Sometimes these structures admit low-dimensional representations that can be easily visualized, allowing one to elucidate the model’s decision-making process with relative ease. Quite often, however, low-dimensional representations do not exist or are not readily available.

This latter situation is often the case for rare disease modelling, where patient outcomes depend on non-linear interactions between numerous features. In such cases, one may require a model that is irreducibly complicated and thus difficult to explain/visualize in one or two dimensions. That is to say that while the methods discussed here can provide clarity and confidence in a model’s decisions and structure, they cannot necessarily eliminate certain facets of models that may remain too complicated to visualize and understand. Below we discuss the application of several techniques that intend to clarify and validate a model’s predictions, with the ultimate goal of building confidence that a specific model is truly suitable for use in a real-world setting.
5.1 Predictor Importance
Predictor importance is typically the first step in understanding a model’s decision making. The output ranks predictors by the amount each one contributes to a model’s ability to identify potentially undiagnosed patients. This output presents a ranking of predictors, allowing for a straightforward initial glance into the model’s processes (see Figure 4). Predictor importance, for decision-tree based models, can be effectively measured with a metric called gain. This metric calculates the relevance of a given predictor to each tree, and thus to each decision/split point, in the model. A higher gain value implies that the predictor is more important to the decision-making process.

Figure 4: Assessment of Predictor Importance

Feature importance as measured by model gain and broken out by predictor type (e.g. frequency, timing, and other – gender/age). The gain measurement adds to one-hundred percent for all predictors included in a diagnostic model. The chart shown here is not exhaustive of all model features, but rather shows illustrative top ten predictors for a gastrointestinal disorder. Source: IQVIA case study.
5.2 Evaluating the Magnitude and Direction of Predictor Importance
While predictor importance is valuable, it doesn’t provide detail on the magnitude or direction (positive or negative correlation) for a given predictor. For these purposes, relative risk measurements are useful in more detailed quantification of a model’s assessment of disease risk associated with specific aspects of each predictor. This metric allows for a calculation of magnitude, or strength, of the risk associated with a given predictor, but also the magnitude associated with a specific facet of the predictor (e.g. frequency, occurrence, timing).

Figure 5: Example of Relative Risk Measurements for Model Validation

Relative risk is defined as the increased risk of diagnosis associated with a specific predictor, such as the frequency and timing of emergency room visits. Source: IQVIA methodology; illustrative examples shown here.

For example, in examining a predictor of a specific disease, such as the occurrence of emergency room visits, relative risk can clarify (see Figure 5):
  1. The risk associated with the occurrence of a visit
  2. Risk associated with how frequently the event occurred prior to diagnosis
  3. Risk associated with specific timing (typically focused on the first event and most recent event prior to diagnosis)
In addition to associated risk, the directionality, or positive versus negative impact, of a predictor can be understood. Some predictors may show a simple trend in the positive versus negative direction, whereas others may fluctuate depending on the specific value associated with the predictor itself (e.g. frequency versus timing).
5.3 Additional Patient Level Analysis
To further evaluate patient-level predictions, a technique known as SHAP can determine, for a given patient or patient subtype (i.e. gender, age group, disease etiology or pathway), the specific set of predictors and quantified contribution of each predictor to a model risk score.6 This method allows for model-driven profiling of individual patients or subgroups and helps clarify the complex and intricate ways in which an AI model derives its risk measurements.
6.0 Rare Disease Case Study – Putting the Methodology into Action
To illustrate the above methodology, a recent study focused on a rare hereditary disorder is summarized below. We describe the overall real-world process of leveraging the above mentioned methodological flow to build a model that seeks to identify potentially undiagnosed patients and provide clinical insight into the pathway to disease.
6.1 Study Background
Patients with the rare condition of interest often present with symptoms that resemble more common chronic illnesses. Due to the rarity of the disease, physicians are not familiar with the diagnosis, and thus it is not top of mind in most cases. These factors make it difficult for patients to be identified and diagnosed, often resulting in delays to proper diagnosis, incorrect treatment, and unnecessary surgical intervention. The goal of the study was to leverage a model to identify HCPs that would benefit from increased awareness of the disorder and understand the pathway to diagnosis, ultimately to accelerate time to diagnosis and appropriate management of the debilitating symptoms associated with the disease. Given that the goals of the study were combined clinical and commercial outreach endeavors, the team selected an open claims data set for the analysis.
6.2 Patient Selection (Positive and Negative Cohorts)
A challenge in this study was that both the ICD-9 and ICD-10 code are shared across multiple conditions. This required specific refinement of positive patients identified in the claims database. Treatments included medication with a label specifically indicated for the disease (disease-specific treatment) and medications that are used across several conditions (disease non-specific treatment). As such, patients were selected into the positive cohort if they had at least one claim for a disease specific treatment. Additionally, patients were selected into the positive cohort if they met the composite criteria of at least one claim for a shared ICD-9 or ICD-10 code as well as at least one claim for the disease non-specific treatment.

For generation of the negative cohort, patients were selected if they did not have any evidence of the disease specific treatment or the combination of diagnosis code and disease non-specific treatment mentioned above. Given the size of the negative cohort and the rarity of the disease (an estimated prevalence of ~1 in 30,000), an evaluation of the size of potentially unrecorded patients (false negatives) in the negative cohort concluded that the machine learning techniques utilized could address a miniscule level of noise expected, eliminating the concern of unknown positives.
6.3 Feature Generation and Model Training
A combination of a data-driven and hypothesis-driven approach was used to generate a comprehensive list of over 300 medical events considered as potential predictors in the model. To fully capture the richness and complexity in the data, metrics including the frequency, sequence and timing of events were generated for each predictor resulting in over 1200 total variables used by the model.

A gradient boosting tree model (XGBoost) was trained using the dataset described above. Model performance was evaluated using a PR curve projected to the prevalence of the disease. In testing, the model successfully identified patients at a precision of 23% at lower recall levels. Comparing this level of performance relative to examining patients at random for the disease, based on the estimated prevalence mentioned above, the model is shown to be highly effective in finding potentially undiagnosed patients (more effective than random by a factor of almost 7,000x). Predictor importance and relative risk analysis confirmed key medical factors in identifying potentially undiagnosed patients with the rare disease. Insights around the importance of the timing of these medical events and the impact on the likelihood that a patient is potentially undiagnosed evaluated in relative risk curves provided guidance on how to design outreach messaging focused upon accelerating diagnosis.
7.0 Conclusion
AI modeling for disease detection has ample opportunity to drive earlier diagnosis for patients in need, and in guiding pharmaceutical companies with highly advanced, targeted diagnostics to help these patients get properly diagnosed and treated earlier in their disease journey. As these algorithms expand in use, applications will widen, and include, for example, timed prediction of diagnosis (i.e. predict a diagnosis a certain amount of time in advance), on-going autonomous learning based on additional newly diagnosed patients (and to account for market changes), and incorporation into EHR systems to predict risk across not just one, but numerous disease states all at once. The use of these models, and advancement in the healthcare space, is undoubtedly valuable, but must be approached with the proper methodological inputs, business considerations, and statistical validation.
About the Authors
The authors of this publication represent the Predictive Analytics practice in IQVIA’s Real-World Solutions global group. The team develops innovative solutions to solve challenging healthcare problems based on patient-level data using a variety of advanced statistical and machine learning methods. This development encompasses applications such as physician targeting and risk stratification algorithms aimed at, for example, finding undiagnosed patients or identifying patients suitable for treatment escalation. Our efforts help improve retrospective clinical studies, under-diagnosis of rare diseases, personalised treatment response profiles, disease progression predictions, and clinical decision-support tools. For questions or more information regarding the information in this article, please contact This email address is being protected from spambots. You need JavaScript enabled to view it. or This email address is being protected from spambots. You need JavaScript enabled to view it..
1 Snomed ct & other terminologies, classifications & code systems. URL https://www.snomed. 314 org/snomed-ct/sct-worldwide

2 Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of 319 the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 320 KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 321 10.1145/2939672.2939785. URL 322

3 Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc 323 plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE, 10(3), 2015. doi: 324 10.1371/journal.pone.0118432

4 Jessa Bekker, Jesse Davis. Learning From Positive and Unlabeled Data: A Survey.

5 Liu B, Lee WS, Yu PS, Li X. In: Proceedings of the Nineteenth International Conference on Machine Learning (ICML): 8-12 July 2002. Sammut, C., Hoffmann, A.G., editor. Vol. 2. Sydney, The University of New South Wales (UNSW); 2002. Partially supervised classification of text documents; pp.387-394

6 S. Lundberg, S. Lee. A unified approach to interpreting model predictions. NeurIPS, 2017.