An Approach to Operationalizing Next Best Action in Pharmaceutical Communications and Marketing

An Approach to Operationalizing Next Best Action in Pharmaceutical Communications and Marketing
Marc-david Cohen, Ph.D., Chief Science Officer, Aktana Inc.; Amy Baer, Senior Data Scientist, Aktana Inc.; Michael Steiner, Senior Data Scientist, Aktana Inc.
Abstract:
The growth of on-line commerce has focused a lot of attention on finding algorithms and approaches to identifying the next best action (NBA) in personalized marketing and sales communications. When the Netflix challenge was launched in 2006 it started a wave of research into collaborative filtering and other approaches to recommendations. The algorithmic approaches that emerged from this challenge continue to be used in the quest to find the personalized message, its timing, the best delivery channel, and message content that is most impactful to the bottom line. However, there is a big difference in complexity between a recommendation for purchase in a retail setting and leading a health care provider (HCP) on a journey to improve pharmaceutical sales and brand loyalty.

Some of this complexity stems from the large number of decision variables, including action timing, and covariates that must be accounted for in an all-inclusive model. Because of the high dimensionality of this problem a large number of observations would be needed to build an accurate and predictive model across the entire space. Then, to use such a predictive model in practice the impact of counterfactuals of the decision variables would have to be estimated and evaluated. The high dimensionality would mean that a very large space would have to be searched to find the estimated best action.

Unfortunately, for model training real-world sales operations data are often limited, so it is likely that large segments of the decision space are not adequately represented in the training data. We propose decomposing the decision space, training separate models for different decisions, and then finding the decisions that maximize these marginal models separately and independently. Strictly speaking the resulting solution will be sub-optimal since it discounts covariance between decisions determined from these independently trained models. However, if there is little training data that would be available for the single all-encompassing model one might expect that the loss from suboptimal decisions would be minimal.

For simplicity we motivate these ideas with an example that has the objective of maximizing email opens. We discuss how this might be applied with several conditional models including ones that (1) predict next message to send, (2) timing of the message send, and (3) wording of the email or message topic.

We use a simulation to show the tradeoffs of using conditional models. The simulation, as in practice, uses a non-linear machine learning approach to train a logistic function with the objective of being able to predict which of a set of possible actions maximizes a zero-one target such as email opens.

Finally, there is additional value to training and integrating separate models into a large automated production environment by minimizing change in the execution environment. This means that
  • Models are quicker to implement, require less data, and yet can approximate the NBA;
  • Innovation can be brought to the client much more rapidly with minimal if any cost in predictive strength from using component models;
  • Additional data can be more easily brought into existing models to improve predictive accuracy on component targets.
Keywords: Next Best Action, prediction accuracy, machine learning, modeling covariance, conditional models, real-world data

Introduction
There has been a lot of discussion in the popular press about next best action (NBA). This concept evolved from market-basket analyses to identify product affinities – if you bought this product then you’d be likely to buy this other product. Much has been written about this and a lot of analytic methodologies have been developed and refined to solve this problem. The Netflix challenge in the early 2000s focused attention on the problem and resulted in significant improvements in predictive analytics, methodologies, and general approaches to predicting what people “like” or are interested in given their history of affinities and purchases.

Much of this work evolved into “recommendation engines” and other concepts that would recommend to consumers a product or service that they would likely be interested in given their history of purchases. Another concept that evolved as part of this discussion and during this period is the idea of 1-1 marketing or personalized marketing. This approach seeks to present offers to consumers that are individually tailored to them based on their historical purchasing and interest patterns.

Recently, these concepts and other related ideas have been developing within pharmaceutical brand management and sales operations with an effort to market to physicians and other health care providers (HCP) in a personalized way, namely personalized to the individual HCP, so that contact strategies and specific communications are made to the HCPs based on the history of communications and interactions they have had and on how the HCPs reacted to those encounters. If one considers all the possible types of communications (and actions such as sampling) and channels for delivery, the combinatorics of the decision problem becomes very large, and simply identifying the next best communication to deliver and over which channel to deliver that communication is difficult.

Ultimately, the purpose of all the communications is to increase sales, prescription writing, and use of the therapeutic. The goal of NBA is to inform this communication strategy and lead the HCP on a journey to optimize the sales objective. However, the scale of decision making on individual communication’s impact on sales is difficult, if not impossible, to measure. So, other measures of performance for communications should be considered. Things like did the HCP read an email sent; did the HCP seek more information after a face to face visit by a sales rep; did the HCP go to a website after opening an email; and many other indicators of HCP engagement with the brand.

Consider a framework to NBA where there are two broad components to the decision problem. The first component is an estimate of a response function that captures the relationship between all the relevant variables and the measure of performance. Relevant variables include all we know about the HCP and their practice and the history of communications to the HCP, and whether the performance measure is either discrete or continuous. The second component is a decision model based on the response function. Some of the dimensions of the response function can be considered decision variables. These are the variables under control. For a given observation of an HCP some of the dimensions are fixed and known and others, the dimensions associated with the decision variables, are the ones used to maximize the response function.

One big caveat to this approach is the assumption of causality. In some domains it is very important to identify causality as distinct from correlation. That is certainly true in this domain of marketing and is a focus of effort when making decisions at a segment level that are associated with significant budgets. However, we will sidestep that discussion in our approach and instead focus on our ability to leverage the historical data with automated modeling, associations, and continuous learning.
Next Best Action
Predicting the next best action (NBA) is a formidable task if we consider at any point in time all the possible actions that can be taken that might positively affect the outcome. For simplicity we motivate these ideas with an example that has the objective of maximizing email opens. Consider alternative actions that can be taken at the HCP level: Visit, Send an email, Send a letter, Provide a sample, Invite the HCP to a seminar, Invite the HCP to a webinar, Take any of these actions now or delay, Select the content of the discussion, email, letter, or webinar, Wording for the email header, etc.

Consider how the impact of these choices might be measured, perhaps by some immediate feedback such as whether the HCP opens the email or clicks on a link within the email, or accepts the invitation to the seminar or webinar, or provides direct feedback from a query during the visit. Each of these responses are closely tied to the alternative actions. Measures of script writing as a response are not as closely tied to the direct action, they are more cumulative in nature as a result of many interactions and are much harder to tie to any specific action. Here we focus on the direct feedback resulting from an action since our attention is on the direct 1-1 impact on the HCP.

Let’s consider the target that we want to impact is whether the HCP opens an email. The decision variables for this might be:
  • the timing of email,
  • time of day email is sent,
  • day of the week email is sent,
  • when the last visit occurred,
  • the content of the email as indicated in the email header,
  • the wording of the email header.
For example, we may ask, “Do certain email headers lead to increased open rates for certain HCPs?”. We can then tailor the decision variable accordingly by suggesting reps use these headers when sending emails to those HCPs. In addition to decision variables as predictors other variables are useful such as, the history of contacts to the HCP that included the combinations of values for these decision variables as well as covariates for the HCPs that include HCP demographic information, data on the practice, HCP preferences (say for channels and therapeutics), 360 views of their communication, and if possible insights into HCP patient population and other covariates that have been collected. Provided with enough data a single model with all these decision variables might be built and used for prediction and making decisions. It quickly becomes apparent that this is a very large decision space and would require a significant data set to adequately fit a response function which could be used to estimate.

Of course, the general goal is to increase script writing and sales performance. We assume that the actions that result in more immediate feedback, such as email opens, as a measure of HCP interest and engagement can cumulatively result in improved sales performance.

There are many decisions such as those enumerated above that impact email open likelihood. In the simulation discussion below, we explore the impact of developing separate response models for each of these decisions and then using a product of the separate response models in the decision model.

Note that many of the ideas developed here also apply to the goal of increased sales and as well can be applied when sales is a target of decision making. These will not be explored in this paper. The next section overviews some real-world data that illustrates the size and scope of this data issue.
Typical Real-World Data
A concrete example based on data we have analyzed can provide more clarity on the problem, the dimensionality, and complexity. Consider a single brand that has about 650K HCP accounts. Let’s assume that for this brand and the desired decision model, the performance is measured by an aggregate of HCP outbound communication. For this brand there are three channels, email, snail mail, and face to face meetings with the following dimensions:
  • 15 products
  • 90 potential email templates covering various brand related topics
  • 65 physical letters covering various brand related topics
  • 12 topics for delivery directly through face to face rep meetings
Each contact with an HCP can be characterized, or described, by these various dimensions. If these were all mutually exclusive in describing the contact, there would be over 1 million possible combinations. Although that is not the case since many do not make sense together, for example sending an email and a physical letter in the same contact, there are many combinations that are legitimate.

Between October 2013 and October 2018 there were approximately 30 million contacts between the approximately 3000 sales reps and the 650K HCPs. This averages out to about 3 contacts per month for each HCP. If we look at summaries of a few of these dimensions by year we can get a sense of the concentration of the data. The histograms in Figure 1 show the distribution of event categories by time and product. Note that each communication contains a message delivered in that communication.

Figure 1: Histograms of Real-World Data




In addition to the sparsity in the contact history each HCP may be characterized by various variables and grouped by segment. Often brand teams do deep analysis and use the results of that analysis to segment and control the contact strategy. While our ultimate goal centers on personalization, segmentation can be used for modeling in the event that we do not have enough historic data and due to sparsity. In our example data there are 45 HCP segment and characteristic variables. Table 1 shows a select few of these variables and their relative density within the dataset. Notice that for some of the variables there are a large number of missing values.

Table 1
Variable % Missing Distinct Values
Brand Segment 43.7% 18
Digital Preference 95.5% 2
Therapeutic Preference 91.2% 58
Therapeutic Segment 98.3% 5
Therapeutic RX Quintile 98.2% 6
Multiple Therapeutic Targets 0% 2
Brand Priority 0% 7


If we look across all these account variables there are more than 147K unique combinations. This means that on average there are 4-5 HCPs out of the 650K HCPs that share the same values for these 45 HCP segment and characteristic variables.

In addition to the channels and the HCP characteristics there are various timings over which communications can be delivered. For example, NBA not only includes the message and channel dimensions but also includes the time dimension. For the complete solution we want to consider the effect of various timings on delivering the message. For example, we would want to account for first visiting an HCP and then sending an email in comparison to first sending an email and then visiting. And, the timing between the visit and the send could vary based on contact history and the nature of the visit and the email.

The following plot shows some of the Goodman Kruskal correlations between some of the HCP characteristics. To protect the client’s data, we’ve removed the names of the selected variables from the graph. Typically, these variables describe the HCP, like gender, age, geographic location, regional population where HCP practices, HCP’s average market share for the target therapeutic, educational history, publication history, and other potentially descriptive characteristics. Unlike the correlation coefficient this statistic is not symmetric so that it captures some directionality in the correlations. Notice that there are pockets of higher correlations and that many of the variables have few values for multiple HCPs. The number of HCPs with multiple values for two variables is shown in the diagonal. The simulation that we report on following will attempt to capture some of this structure.
Simulation
As mentioned above the general approach to the decision problem is to model a response surface from the complete set of predictors to a target that is desirable to optimize. We do not focus on the nature of the target variable. Suffice it to say that it is some measure of HCP satisfaction that is highly correlated to client revenue for the target therapeutic. We assume that it is correlated to the collection of predictors. We will also not address the issue of causality vs correlation in the predictors. Although both of these topics are very important, they are not the focus of this paper and would require a much more extensive research. We and our clients assume that some set of predictors is causal and drives the target performance.

The general approach is to fit a response curve to the target. Then, based on the client understanding of causality a set of variables is identified as the decision variables. The resulting decision problem is to find the values of the decision variables that optimize the target for each observation. We fit the response curve using machine learning algorithms such as random forests, gbm, or other technique and then for each HCP fix their characteristic variables and history, and predict the response using the modeled response curve as we vary the decision variables across the range of actions for them.

The core question we explore is the cost of ignoring correlation when building a response surface model. This approach can result in a large prediction model for the response surface. Let X, Y be a set of predictors and Zε{0,1} be a binary target. Then, the response surface is:

pr(Z|X,Y) ≈ pr(Z|X)pr(Z|Y)

Figure 2: Correlations of Real-World Data


Simulation Setup
We explore this with a simulation that replicates some of the structure that we observe in real-world data; namely the sparsity and the small response likelihood in a binary setting. We will explore this across three dimensions, (X,Y,Z), where two of the dimensions (X,Y) are continuous and uncorrelated with each other and the third Z is discrete [0,1] and correlated with each of the other dimensions with correlation ranging from approximately -.08 to -.25. An example of the relationship between this simulation setup and the problem as discussed above is to consider one of the continuous dimensions, say X, as an HCP characteristic and the other, say Y, as a decision variable, for example, what time to send an email. Then, the variable Z is the outcome as to whether the email is opened or not. In practice both X and Y would be multidimensional even though Z would be binary or continuous.

We build three response models each a logistic function estimated using the gradient boosted models with the gbm() function in R.
  • a Joint model predicting pr(Z|X,Y),
  • an X (marginal) model predicting pr(Z|X), and
  • an Y (marginal) model predicting pr(Z|Y).
We then analyze the decision model comparing the performance of the models of pr(Z|X,Y) to pr(Z|X)pr(Z|Y) for various cutoffs for the decision Z = 1 and how the correlation ignoring strategy compares.

The simulation is a grossly simplified 2-dimensional model in contrast to the real-world scenarios where the model space can be 100s of dimensions and much more sparse. We have attempted to emulate the sparsity of the actual responses being fit. The 2 dimensions are both uniform [0,1] random variables with 1000 sampled points. The responses are modeled by a bivariate normal random variable with mean (.5,.5) and variance matrix . We set the target to one for points in the lower left diagonal of the sampled hypercube (X+Y<1) and for values where the sampled bivariate Normal is greater than a specified quantile of the Normal sample. A separate simulation is run and analyzed for each quantile from .1 to 1.

Figure 3: Three Samples






The three plots in the first row (Figure 3) show the samples for the .25, .5, and .75 quantiles. Note that the titles of the plots show actual probability of open, namely Σ(Target = 1)/N where N = 1000 points in the sampled hypercube. The points in red are the observations where the target is 1 and the title shows the actual probability that the target is 1 for each of the samples. The second row just shows the observations where the target is 1.
Simulation Process
The execution logic for each of the 5 simulations is as follows:
  • Sample the predictor hypercube and the target=1 points.
  • Use Generalized Gradient Boosting (gbm) for fitting three logistic regression models: pr(Z|X,Y), joint model with predictors x and y; pr(Z|X), a marginal model with predictor x; and pr(Z|Y), a marginal model with predictor y.
  • Produce various plots to compare the joint, marginal, and product of marginal prediction models fit
  • Compute the confusion matrix for the joint and product models for various cutoffs that might be used in the decision model.
  • Produce various plots to compare the decision model performance for the joint and product of marginal prediction models.
Figure 4: Predictions for the Joint Model and X Marginal Model


The plots in Figure 4 show the predictions for 2 logistic models estimated by the gbm function on the data in the left column in Figure 3 with probability of target=1 of .292. The first row shows the results for the joint model where both x and y were used to estimate the probability that the target is 1. The left plot on the first row shows the joint prediction as a function of the x dimension and the right plot shows the prediction from the joint model as a function of the y dimension. It is interesting to compare the x marginal model predictions as a function of x (lower left plot) vs the prediction as a function of y (lower right plot). As expected, the estimates of the probability of target=1 for the x marginal model show no structure when plotted as a function of y.

Figure 5: Joint Predictions pr(Z|X,Y) vs the Product of Marginal Predictions pr(Z|X)pr(Z|Y)


Figure 5 is a plot of pr(Z|X,Y) vs pr(Z|X)*pr(Z|Y). If there were no correlation between the Z and X, Y we would expect the points to line up on the diagonal because in that case pr(Z|X,Y) = pr(Z|X)pr(Z|Y). The plot shows points where Z = 1 in red. The decision model is derived from the prediction model by providing a probability cutoff value above which the decision is to send the email with the presumption that the email will be opened with probability derived from the prediction model. So, one way to compare the performance of the decision model across the two methods of prediction, one would choose a cutoff value, say .5, and then look at the statistics of performance for the Z=1 vs Z=0 for all observations above that cutoff in each dimension.

One tool commonly used is the confusion matrix and the statistics based on it such as accuracy, sensitivity, and specificity.
Simulation Results
Figure 6 shows such a comparison for two of the simulations. The plots show a comparison of the joint estimates and the product estimates across a spectrum of cutoff probabilities. Observations from the plots in this figure are consistent with what is observed across the simulations which are not shown here.

Figure 6: Comparison of Decision Model Accuracy, Sensitivity, and Specificity




  • The Accuracy of the product estimates is consistently better than that for the joint estimates. The Accuracy measure is the ratio of the sum of the predicted positives that are correctly identified as positive (Sensitivity) and the predicted negatives that are correctly identified as negative (Specificity) divided by the total population. As a result, it contains two sources of error - the sensitivity and the specificity. These represent type I and type II errors.
  • The Sensitivity of the product estimates are consistently below that of the joint estimates.
  • The Specificity of the product estimates are consistently above that of the joint estimates.
  • The differences in these measures converge as the cutoff grows to include larger proportions of the population.
These plots of accuracy, sensitivity, and specificity show that accuracy of the product estimates is higher than the joint estimates because their sensitivity is much inferior to the joint estimates. This means that a decision model based on the product estimates is likely to take action more aggressively and make many more type I errors and many fewer type II errors. So, if the decision is whether to send an email the product estimate will send more emails to people who are not going to open them than the joint estimate base decision model would. But the product model would be less likely not to send email to people who would open them than the joint model. In the context of a marketing decision this is probably desirable since the cost of over-reaching is most likely lower than the cost of under-reaching—which in the context of sending email might be spamming the target audience. If it is more important to avoid spamming, then the methodology could be applied but with reversing the interpretation of the target.

One final note concerns the choice of cutoff value. This is typically done by optimizing a loss function of the specificity and sensitivity that captures the costs associated with the different kinds of errors. The optimum of this may not be the same as the accuracy maximizing point.

The plots in Figure 7 show the differences between the performance of the joint and product estimates across the range of cutoffs for all the probability of Target=1 choices in the simulation and for the three performance measures. For each of these plots, if the difference is positive the joint estimate is “better” than the product estimate and vice versa.

Figure 7: Overall Accuracy, Sensitivity, Specificity by Cutoff






The leftmost plot shows the differences in the accuracy measure as a function of the target probability in the raw data, in other words, if the probability of Target=1 is .15 that means that 15% of the population has a target of one. The various differences are for each of the various cutoffs tested across all the target probabilities. As one would expect, as the cutoff increases to include the entire population (cutoff=1) the differences across each of the measures converge. It is interesting to note that the accuracy and specificity differences are almost monotonically changing with the cutoff value but that the sensitivity shows a change in direction as cutoffs increase.
Conclusion
Operationalizing NBA decisions is a complex issue and involves many tradeoffs. This paper attempts to focus on one small compromise that we believe makes it a more realistic and reachable goal and provides the beginnings of a divide and conquer methodology. We feel this approach helps to add NBA to an automated system designed to provide recommendations and suggestions on a real-time basis in an operational setting. Our experience is that sales operations are very sensitive to changes in an operational system and what impacts those changes might have on suggestions. In practice, if we can limit changes it will limit the cost of putting those changes in place.

This simulation just begins to explore the notion that one can use pr(Z|X)pr(Z|Y) for pr(Z|X,Y) and ignore known correlations. However, we feel there are some explorations that would help to illuminate the tradeoff that this compromise makes. These include:
  • Replications - doing independent replications of this and like-minded simulations to understand the variance associated with these results.
  • Higher dimensions - making the predictors of higher dimensions so that greater insight into the effect of sparsity on the compromise could be obtained.
  • Correlation among some of the predictors - the 2-dimensional simulation we performed were independent. This avoided multicollinearity which might have an impact on the machine learning based logistic models that resulted.
  • More advanced functions characterizing the target - what would the impact be if instead of a binary outcome variable the outcome was ordinal or continuous.
The simulation and much of the discussion does not address the fact that for NBA we are interested in multiple responses that impact overall HCP value. This is a large topic and could not be addressed in a short paper. However, the concepts developed here can be generalized to predict a single measure of performance across the multiple predictive dimensions.
About the Authors
Marc-david Cohen, Ph.D., Chief Science Officer, Aktana, is an experienced business leader with a background in operations research and statistics. At Aktana he leads the development of learning and insight generation capabilities. Previously he served as CSO at Archimedes Inc. - a Kaiser Permanente Company - and helped the company transform from HEOR pharmaceutical consulting to a products company focused on clinical studies and personalized medicine. Previous roles included VP of Research at FICO and multiple senior roles at SAS Institute where he initiated the SAS Marketing Optimization product.

Amy Baer, Senior Data Scientist, Aktana, is a biostatistician with a background in predictive analytics. At Aktana she conducts research-related projects for the machine learning team including improvement of model performance and feature interpretability. Previously, she worked at MedAmerica as a statistician focusing on preventative patient care and emergency room operations. Amy holds a master’s degree in biostatistics from California State University, East Bay and a bachelor’s degree in mathematics from The University of Chicago.

Michael Steiner, Senior Data Scientist, Aktana, is a mathematician with a background in topological data analysis. At Aktana he builds prototype models in order to improve and extend the suite of services offered, and he works on special projects for clients in a consulting role. Previously, he worked at Forio as a Data Scientist focusing on simulations. Michael holds a master’s degree in mathematics from San Francisco State University.
References
1 Altmueller, S., Haralick, R., “Approximating High Dimensional Probability Distributions,” Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04), pp. 299-302, Aug 2004.
2 Chessell, M., Pugh, D., “Smarter Analytics: Driving Customer Interactions with the IBM Next Best Action Solution,” http://www.redbooks.ibm.com/abstracts/redp4888.html?Open. Aug 2013.
3 Generalized Boosted Regression Models R package. https://cran.r-project.org/web/packages/gbm/gbm.pdf.
4 Islam, M., Chowdhury, R., Briollais, L., “A Bivariate Binary Model for Testing Dependence in Outcomes,” Bulletin of the Malaysian Mathematical Sciences Society, Vol. 35 Issue 4, pp. 845-858, 2012.
5 Lee, Y., Nelder, J., “Conditional and Marginal Models: Another View,” Statistical Science, pp. 219-238, 2004.
6 Li T., Zhu S., Ogihara M., Cheng Y., “Estimating Joint Probabilities from Marginal Ones”. In: Kambayashi Y., Winiwarter W., Arikawa M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2002. Lecture Notes in Computer Science, Vol. 2454. Springer, Berlin, Heidelberg
7 Moore, C., “Pharma Marketing: Get Started on Creating Great Customer Experiences with Journey Strategies,” https://a.sfdcstatic.com/content/dam/www/ocms/assets/pdf/datasheets/pharma-insight-paper.pdf.
8 Muff, S., Held, L., Keller, L., “Marginal or conditional regression models for correlated non-normal data,” Methods in Ecology and Evolution, Aug. 2016.