### Improve HCP Email Message Response Through Personalized and Optimal Messaging Strategies Derived From Machine Learning Models

*Anna Decker, Ph.D., Senior Data Scientist, Aktana Inc. and Marc-David Cohen, Ph.D., Chief Science Officer, Aktana Inc.*

*Machine learning (ML) approaches are commonly used in retail settings and are starting to emerge in pharmaceutical brand management and sales operations. These approaches to predictive analytics can identify optimal messaging and interaction strategies that improve communication with healthcare providers (HCPs) by finding complex non-linearities in response functions that have many potential predictors. Non-parametric ML models are particularly helpful when insight into the functional form of the relationship between predictors and the response is not the main objective but quick, automated implementation in an environment where the number and type of predictors can change rapidly is important. Traditional methods based on parametric models are more valuable when a deep understanding of the shape of the relationship of predictors to predicted is desired and used to illuminate insights. For example, to capture complex non-linearities when using logistic regression, explicit transformations of the predictors are needed in contrast to random forest based models where hidden non-linearities are “found” by the modeling technique. Furthermore, fewer distributional assumptions are needed with ML models than with many more “traditional” approaches while predictive accuracy is often improved. We introduce these concepts and demonstrate some of these observations with an example. The main focus of this paper is to describe the problem of personalizing frequently changing messages to individual HCPs using an ML approach most suitable for its ability to be automated and to produce predictive accuracy that is comparable to other techniques. We show a progression of uses of random forest models that (1) provides an automated approach that predicts the next best message to send for each individual HCP and (2) identifies the segment-specific messages that HCPs are most likely to read. This model accounts for HCP covariates and historical HCP reactions to previously sent messages. Using this machine learned model of the joint distribution of the data, we simulate expected outcomes under hypothetical interventions to generate longitudinal predictions of message reaction. Our experience is that clients develop confidence when they understand the model from insight into these longitudinal predictions and then quickly proceed to full integration of the ML model scoring to gain its full potential. We demonstrate these methodologies with anonymized data from two clients and show estimates of expected improvements as well as actual doubling of email open rates by using these new techniques.*

*machine learning, optimal HCP message strategies, simulation, big data analytics, improving email response rates, automating strategy execution*

Message sequencing is one aspect of a larger optimal messaging strategy. Understanding and optimizing the timing, content, and types of messages within and across communication channels is important for managing the relationship between HCPs and the brand and how valuable the brand is to the HCP. This article provides one optimization approach for message sequence with the specific example of email messages.

The utility of ML models in this approach is paramount. They are commonly used in estimating customer reactions in retail settings, such as with the use of Bayesian networks in direct marketing response models by Cui, Wong, and Lui or rule induction and neural networks in data mining by Bose and Mahaptra.

^{1,2}In the area of pharmaceutical brand management and sales operations there is an opportunity to use ML models for personalizing and automating marketing execution to ensure the most efficient and effective distribution of marketing resources.

Building a predictive model that performs well on both collected and new data is vital to this approach. Machine learning allows for more flexible modeling of the underlying joint distribution of covariates and features that capture historical HCP message behavior that more closely resembles the truth than a typical linear regression.

^{3}A variety of methods such as logistic regression, Bayesian generalized linear models, or recursive partitioning could be used for this modeling, however, many available ML methods are more effective at capturing complex nonlinearities in the data.

^{4,5}Additionally, it was important to take steps to protect against overfitting, i.e. the model working well on training data but not performing well on new test data.

^{3}Cross-validation is an effective method that we employ to avoid overfitting.

^{6}In particular, random forests are a ML technique that builds decision trees based on random resampling of the data and are inherently useful for this procedure since the decision to open or click a message is dichotomous and has an associated probability of occurring,

^{4}however, it can be used when there are other measures of HCP interest.

Given this type of predictive model, the probability of a message being opened or clicked given the HCP-level covariates and history of messages received can be estimated for a given message. Comparing the probabilities for different potential next messages to send provides a method for making that choice that maximizes the HCP interest in this next message. This prospective prediction allows for more intelligent message suggestions and is able to adapt over time when new messages are added to the message set and in addition is applicable to new accounts as well. Thus, rather than an arbitrary sequence of messages that is the same for all HCPs, the algorithm provides a personalized sequence based on covariate data and historical message reaction data that optimizes the chance of an open or click.

*Potential outcomes*

Let W represent the matrix of baseline covariates in a data set of interest, A represent the send of a message of interest, and L represent the past sends and opens of other messages in the message set, and Y be an indicator of whether the message of interest has been opened or not. We are interested in modeling the conditional expectation of Y given the other variables, i.e. E[Y |A, W, L], which could be modeled using many approaches. Estimating the probability of a message open/click given the historical covariates and past open/click behavior of the HCP for each potential message allows for a ranking of the messages based on the predicted probability of open/click. The message with the highest probability can be promoted as the next best message to send.

In order to determine an optimal sequence of messages for a particular segment of HCPs or individual HCPs, an additional step was required. Using the modeled distribution of the probability of a message open, we deterministically intervene and set message send to be a “yes” for each model and then predict using this intervened data to obtain the probability of a message open under a potentially counterfactual message send.

^{7}This is a technique that is commonly used to estimate parameters motivated by the causal inference literature and provides an estimable statistical parameter that can make use of ML models. More formally, the so-called counterfactual outcome is calculated as

Y

_{1}= E[Y|A = 1, W, L]

which is the average observed outcome for HCPs that did receive the message and the counterfactual outcome for HCPs that did not receive the message. This parameter corresponds to the treatment specific mean under a deterministic intervention and is well studied in the causal inference literature.

^{7-10 }The predicted probability under this intervention measures how likely the message is to be opened if all HCPs received it.

*Sequence Prediction*

Computationally, the conditional probability of an open or click given the history of message sends and covariates can be fit using a variety of methods.

^{5,11}In our implementation, we utilized random forests to fit the conditional probability in order to avoid making unnecessary assumptions about the functional form of the relationship between the probability of a message being opened or clicked and the covariates and history of message opens. We fit a separate random forest model for each message given past observed data and use an ordered progression of models to generate the optimal sequence of messages to send for a finite message set. Note that these models may be built at the individual level (if there are enough data points), or at a pre-specified segment level, or for all HCPs.

From the initial message model and associated predictions, two copies of the intervened-upon data were created. In one copy, the counterfactual was that each HCP opened the first, highest likelihood message, and in the other, the assumption captured in the counterfactual is that each HCP does not open that message. These intervened-upon data sets were used in the random forest model to estimate the next most likely next message each HCP would open if sent. These predictions were then used to create the next pair of counterfactual data sets. This process creates a sequence for each HCP of messages that they are next most likely to open. Repeating this for each possible next message allowed us to rank the messages based on descending probability of an open given possible past send and open patterns. Applying a similar procedure across the entire message set, we obtained sequences of messages with their associated probabilities. At each step, the next best message to send is determined based on the history of sends and opens. This simulation allows for prospective sequence prediction and gives a potential optimal sequence. However, the send and open behavior may change for an individual HCP, so the observed sequence may deviate from what is predicted here. This process is summarized in Figure 1.

The determination of an optimal sequence requires a finite and small message set and assumes that messages are not sent repeatedly, although repeat sends could be accounted for. It also assumes a correctly-specified model.

*The Utility of Random Forest*

We elected to use random forest to model the conditional distribution because of its ability to detect complex interactions and non-linearities in the data. Random forests are ensemble learners that combine so-called “weak” learnings (decision trees). The algorithm resamples a percentage of the data and a subset of the predictor variables to build a decision tree, in which predictors are added based on how well they partition the data according to some objective function. Two parameters in random forests typically control the overall fit; one is the depth of the tree search and the other is the number of trees in the random ensemble. Care should be taken to match these to the dimensionality of the data to avoid overfitting.

To demonstrate random forest’s capabilities, a simulation study was performed, where we simulated variables and had a known functional form that related the outcome to the predictors. The simulation performed as follows:

X1 ~ Normal(

*μ*=5,

*σ*=8)

X2 ~ Normal(

*μ*=0,

*σ*=2)

X3 ~ Binomial(p = mean(X1 > 6))

X4 ~ Binomial(p = mean(X1 + X2 < 0.4))

X5 ~ Normal(

*μ*= mean(X3),

*σ*=5)

Y = X1 + X1

***X2 – X3 + X5

^{2}+ X3

^{3}+ X5

***X3

***X1

The relationship between Y and the predictors was made deliberately complex to demonstrate the utility of random forest for detecting such a relationship. The performances of a simple linear regression, random forest, and a correctly specified model were compared using the mean-squared error (MSE). The expectation was that the correctly specified model would fit the best, i.e. have the lowest MSE, random forest would be able to fit some of the interactions and have the second-best MSE, and that simple linear regression would have the highest MSE. These expectations were met in our simulation. The results are summarized in Table 1.

Model |
Mean-squared error (MSE) |

Simple regression | 2398.2 |

Random forest | 1235.2 |

Correctly specified model | 1250.8 |

For a relationship with complex interactions and a non-linear relationship, random forest performed almost as well as the correctly specified model while simple regression did not. Since we do not know the underlying functional form of the relationship between the outcome and predictors, it makes sense to choose a method that does not make any unnecessary assumptions about the functional form. In this example, while a careful study of the relationship between the predictors and targets might yield a correctly specified parametric model, the larger the number of predictors, and the more complex the relationship, the harder this structure is to uncover. Therefore, we advocate for the use of an algorithm that can learn the joint distribution empirically. Using this as the basis for the message sequence optimization procedure ensures that the conditional distribution is modeled as accurately as possible. When fitting these models, we output predictive performance metrics such as area under the receiver operating characteristic curve, accuracy, and sensitivity in order to assess model performance. In general, we have seen that using covariate data and send, open, and click history results in high predictive performance.

*Limitations of the Current Approach*

The method proposed above is based on historical data that have been collected in CRM (Customer Relationship Management) systems. Thus, brand new messages that have not been previously sent are required to undergo a period of being sent at random to gather send, open, and click data before a model can be built. We specified three possible strategies for these random sends: aggressive, moderate, and conservative. All three are based on the range of predicted probabilities of opens across all messages for all HCPs (or segments of HCPs). The aggressive strategy sends the new message with high probability relative to this range, the conservative sends it with low probability relative to this range, and the moderate strategy sends the new message with a medium relative probability. These strategies allow the customer to prioritize new messages and control the amount of time required to collect the necessary data to build a new model. In our implementations of this approach, we found that a reliable model can be built once a message has been sent at least 20 times. Additionally, a text analysis of historical messages and their similarity to new message content could be used to guide initial send probabilities.

In addition to handling new messages, the method as proposed does not allow messages to be resent. However, a reasonable extension that we are implementing is to allow messages to be sent again under certain conditions, such as if a message was not previously opened or clicked or if a certain amount of time has passed. While the current method avoids message fatigue, where the same message is sent over and over, messages may need to be sent more than once before they are opened or clicked.

This approach requires a set of HCPs, a finite message set to be sent to that set of HCPs, and sufficient past data to support the construction of models for each message, i.e. a sufficient number of past sends, past opens, and past clicks. It is adaptive, in that models can be refit to account for new send, open, and click behavior, additional models can be built for new messages, and new HCPs can be added. It does not take into account other communication channels such as visit details, webinars, or phone calls. A multi-channel messaging strategy built on the same methodology described above could provide better overall insights into messaging strategies in the future.

The optimal sequence can then be determined by choosing the sequence with the maximum joint probability of open. This sequence may be used in the planning stages before rolling out a new messaging strategy. The models may also be used to determine the next best message to send based on historical open and click behavior as well as demographic variables.

The message sequence optimization tool was recently deployed for almost 400 representatives who contacted approximately 1260 HCPs regarding a single product for Customer B. To preserve client confidentiality, we are unable to share detailed performance impact. However, the tool has been in use for three months and so far, we have seen open rates double.

The utility of ML to this approach is vital. Random forest performed almost as well as a correctly-specified parametric model in a simulation study, and the message sequence optimization procedure relies on the conditional distribution of the probability of opening a message being modeled correctly. Other ML approaches may also be used to model the conditional distribution of the data such as support vector machines, neural networks, clustering methods such as k-nearest neighbors, or ensemble stacking methods.

^{3,12–18}

The example detailed in this article focused on the optimization of email message sequence. Since email messages have a clear response (open or click), this provided a clear outcome variable to model. A natural extension to this analysis would be to expand to other channels of communication such as visit details, webinars, or congress meetings. This expansion would complicate the optimization approach in two ways: (1): it would require a channel-agnostic measurement of HCP engagement and (2) it would require modeling the interaction between each channel. Additionally, this method determines the next best message to be delivered but does not determine the optimal timing of the next message. The timing of a message send can have a large impact on the probability of a message open since HCPs may have a higher or lower receptivity to a message depending on the time between events. For example, an email sent to follow up after an in-person visit might have a higher propensity to be opened than an unsolicited message. This further addresses variability in HCP open behavior and we currently use a separate model to determine when to send messages that is also based on a machine learned model of the joint distribution of the data.

**Anna Decker**

*, Ph.D., Senior Data Scientist, Aktana Inc., is a statistician specializing in causal inference and semiparametric estimation using machine learning. At Aktana she builds machine learning models with an emphasis on thoughtful application of modeling and statistical parameter estimation to answer marketing questions. Previously, she was in a bioinformatics scientist at Genentech in research and development. She received her PhD in Biostatistics from University of California, Berkeley.*

**Marc-David Cohen**

*, Ph.D., Chief Science Officer, Aktana Inc., is an experienced business leader with a background in operations research and statistics. At Aktana he leads the development of learning and insight generation capabilities. Previously he served as CSO at Archimedes Inc.—a Kaiser Permanente Company—and helped the company transform from HEOR pharmaceutical consulting to a products company focused on clinical studies and personalized medicine. Previous roles included VP of Research at FICO and multiple senior roles at SAS Institute where he initiated the SAS Marketing Optimization product.*

^{1 } Bose I, Mahapatra RK. Business data mining — a machine learning perspective. *Inf Manage*. 2001 Dec 20;39(3):211–25.

^{2 } Cui G, Wong ML, Lui H-K. Machine Learning for Direct Marketing Response Models: Bayesian Networks with Evolutionary Programming. *Manag Sci*. 2006 Apr 1;52(4):597–612.

^{3 } Friedman JH, Tibshirani R, Hastie T. The Elements of Statistical Learning. 2nd ed. Berlin: Springer Series in Statistics; 2001.

^{4 } Breiman L. Random forests. *Mach Learn*. 2001;45(1):5–32.

^{5 } Gelman A, Jakulin A, Pittau MG, Su Y-S. A weakly informative default prior distribution for logistic and other regression models. *Ann Appl Stat*. 2008 Dec;2(4):1360–83.

^{6 } van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. *Stat Appl Genet Mol Biol*. 3(1).

^{7 } Robins JM. Marginal Structural Models versus Structural nested Models as Tools for Causal inference. In: Statistical Models in Epidemiology, the Environment, and Clinical Trials [Internet]. Springer, New York, NY; 2000 [cited 2017 Nov 6]. p. 95–133. (The IMA Volumes in Mathematics and its Applications). Available from: https://link.springer.com/chapter/10.1007/978-1-4612-1284-3_2

^{8 } Robins J. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. *Math Model*. 1986 Jan 1;7(9):1393–512.

^{9 } Wang A, Arah OA. G-Computation Demonstration in Causal Mediation Analysis. *Eur J Epidemiol*. 2015 Oct;30(10):1119–27.

^{10 } Naimi AI, Cole SR, Kennedy EH. An introduction to g methods. *Int J Epidemiol*. 2017 Apr 1;46(2):756–62.

^{11 } Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. Taylor & Francis; 1984. 372 p.

^{12 } Vapnik V. Estimation of Dependences Based on Empirical Data. Springer; 2006.

^{13 } Specht DF. Probabilistic neural networks. *Neural Netw*. 1990 Jan 1;3(1):109–18.

^{14 } Fix E, Hodges J. Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties. California Univ Berkeley; 1951 Feb.

^{15 } van der Laan MJ, Polley EC, Hubbard AE. Super Learner. *Stat Appl Genet Mol Biol*. 2007;6(1).

^{16 } Burez J, Van den Poel D. CRM at a pay-TV company: Using analytical models to reduce customer attrition by targeted marketing for subscription services. *Expert Syst Appl*. 2007 Feb 1;32(2):277–88.

^{17 } LeCun Y, Bengio Y, Hinton G. Deep learning. *Nature*. 2015 May 27;521(7553):nature14539.

^{18 } Kim H-C, Pang S, Je H-M, Kim D, Yang Bang S. Constructing support vector machine ensemble.* Pattern Recognit*. 2003 Dec 1;36(12):2757–67.