Deploying Machine Learning for Commercial Analytics

Deploying Machine Learning for Commercial Analytics
Jean-Patrick Tsang, PhD and MBA (INSEAD), President of Bayser Consulting

Keywords: commercial analytics, machine learning, panoramic contextual data, feature engineering

Machine Learning is seeing unprecedented success across the board.  Not a week goes by without our hearing of a new accomplishment or breakthrough. Let’s mention only two. First, AlphaGo Zero. This is a Reinforcement Learning program developed by DeepMind, a UK-based company that Google acquired in Jan 2016. Zero refers to the fact that the program starts learning from scratch, without any human intervention. After 40 days of training on a regular laptop, AlphaGo Zero beats AlphaGo 100 to 0 and AlphaGo is the program that defeated the world champion at Go. Second, the self-driving car. As of Nov 2017, Waymo, Google’s driverless company, started running autonomous minivans around Phoenix with no humans inside to grab the wheel should something go wrong. In just a few months, passengers will be invited to climb aboard the world’s first driverless ride-hailing service. Waymo is arguably the most prominent contender but is far from being the only one. Eighteen companies are vying  for a leadership position in the self-driving car market including GM, Ford, Daimler, Renault-Nissan, BMW, and Tesla.
If you are wondering about the relevance of all this to Commercial Analytics for Pharma, I have three key messages for you. First, Machine Learning is extremely relevant for Commercial Analytics and the stars are aligned.  There are a few things that we need to get right though. Second, we identified eight problems that lend themselves to Machine Learning treatment that happen to be key problems for Commercial Analytics. They are high ROI problems in that with the proper deployment of Machine Learning, we’ll be looking at very powerful solutions that may transform Commercial Analytics as we know it. They may even rewrite the agenda of the next generation of problems to tackle.  Third, we’d like to share some of the lessons we learned from doing projects and some of the pitfalls to steer clear of. In a nutshell, data acquisition and feature engineering are key and they play a larger role than algorithm selection. Also, be wary of MINA (Missing Is Now Absent) as it can doom an otherwise perfect project.
Stars are Aligned
Now is an excellent time to get started. When one is having trouble getting up to speed, it’s usually because of one of three reasons. First, poor mastery of the subject matter. Today, this can be fixed easily. There is an abundance of very good material on AI and ML ranging from the basic to the very advanced that is freely available on the web. If you are unsure of how back-propagation updates the weights of the synapses or why Stochastic Gradient Descent overshoots the local minimum or how Ridge regularization differs from Lasso, there are hundreds of sites that shed light on the matter.
Second, no good platforms to work with or they are exceedingly expensive. That’s definitely not the case here. There are several open-source platforms to choose from and you do not even have to shell out a penny.  Here are the major ones.
  • Google’s Tensor Flow - probably the most popular one; it’s great for  deep learning, mathematical computations, and reinforcement learning
  • Scikit-learn - a high-level framework built on top of Numpy and SciPy that supports both supervised and unsupervised learning
  • Spark MLib - a general-purpose library that provides algorithms for most use cases and can be used with Scala, Java, Python, and R
  • Facebook’s Torch - a very friendly deep learning tool which owes its friendliness to Lua, a simple scripting language
  • Université of Montréal’s Theano - an excellent low-level library for scientific computing based on Python that is often used with more user-friendly programs such as Keras, Lasagne, and Blocks
  • UC Berkeley’s Caffe and Caffe2 - a special-purpose machine learning environment that comes with an abundance of pre-trained models for image analysis
  • Eclipse Deeplearning4j - a deep learning programming library written for Java that includes implementations of the Restricted Boltzmann Machine, Deep Belief Networks and Deep Auto-encoders, and more.
Third, no place to turn to when one is stuck. That’s also not true. We have Github and it is definitely the go-to place. It is an open-source software clearinghouse and hosts all kinds types of projects. It is truly a great resource to turn to.  Odds are an answer to your question may already be there.
Now that we are ready to get started, there are two basic questions we need to address. First, which algorithm to deploy for the problem at hand? Second, is the approach scalable?
The best resource to turn to when deciding which algorithm to deploy is undoubtedly Kaggle. Kaggle is a platform where companies offer prize money to solve predictive modeling problems. Its live leader board encourages participants to continue innovating beyond existing best practice. Competitions on the Kaggle site regularly attract over a thousand teams and individuals.  It was founded in 2010 and was acquired by Google in March 2017. To date, there are more than a half million registered users from 194 countries. Heavyweight participants include IBM Watson’s Jeopardy-winning team, Google’s DeepMind, and the like. In  a nutshell, Kaggle tells us what works and what does not. Techniques Kaggle winners employ time and again include Boosted Trees (XGBoost) for classification problems and CNN (Convolutional Neural Networks) for image analysis. Ensemble techniques are also deployed to provide a boost in performance, which oftentimes is just what’s needed to snatch first place.
Actually, Kaggle was inspired by the Netflix prize. Back in 2006, Netflix was selling discs of movies and TV shows and needed to improve the accuracy of its movie recommendations.  Netflix offered $1 million to anyone who could improve by 10% the predictive accuracy of Cinematch, its recommendation engine. That was an instant hit and people went crazy. Tens of thousands of participants started downloading the data, building models, and uploading their predictions.  This crowd sourcing initiative was a godsend for Machine Learning as it funneled the energy of tens of thousands of people into solving the predictive problem and improving tools and techniques in the process. The competition went on for three years and the $1 million bounty was finally awarded in June 2009 to BellKor’s Pragmatic Chaos group.
The Netflix problem reminded the community of the relevance of matrix decomposition. The people-movie rating matrix, which by the way is a sparse matrix as a person only rates a fraction of available movies, can be expressed as the product of two matrices: a people preference profile matrix and a movie profile matrix. Expressed this way, one can more easily infer the rating of a person for a yet unrated movie by applying the person’s preference profile to the movie profile of the yet unrated movie, using a simple dot product.  This  line of investigation led to several improvements in SVD (Singular Value Decomposition), the technique of choice for matrix decomposition, and resulted in more sophisticated versions of the SVD including the asymmetric SVD and SVD++.
What about the scalability of the Machine Learning approach? For starters, many of the algorithms in the open-source platforms have already been deployed on very large data sets and scalability has not been an issue.  In the unlikely event that scalability is an issue, there are two additional avenues one can pursue. One is to employ a cloud-based solution such as AWS (Amazon Web Services) which leverages parallel computing. The other is to deploy specialized hardware. NVIDIA just launched its Titan V graphics card which is explicitly designed for deep learning. The GPU card can be installed on a regular PC and costs about $3,000 as of this writing.  In terms of performance, it is nine times faster than the previous generation at over 100 Teraflops with 12 GB of high-bandwidth memory.
A significant perk of getting involved in Machine Learning now is a large and active community. This means a constant supply of new tools and techniques. Let’s mention the most notable ones.
Word2Vec is a technique that comes from Tomas Mikolov at Google.  It consists of representing a word by a vector and that vector is inferred from the frequency of words that appear in the same sentence as the word of interest.  As a result, similar words such as “engine” and “motor” will have similar vector representations. What’s interesting is that we can use vector additions and subtractions to make semantic inferences. For instance, what is vector(Paris) - vector(France) + vector (Italy)? Well, it’s vector (Rome). In the same vein, if you guess that vector(Woman) - vector(Man) + vector(King) = vector(Queen), you’d be correct. This technique is just what we need to convey meaning to a zip code: its population size, its socioeconomic profile, the presence of managed care, the influence of IDN’s and the like.
More recently, Ian Goodfellow, currently with Google, came up with a novel architecture, the GAN (Generative Adversarial Network), which Yan LeCun, a prominent  figure in deep learning, singles out as the most interesting idea in the last 10 years. The GAN consists of two neural nets that work against each other. One is the forger. It generates a forged copy of the real McCoy and tries to pass it for the real thing. The other is the examiner and its job is to tell apart the fake from the real one. The system has learned when the forger produces fakes that are so good that the examiner can no longer call them out. This technique has been successfully deployed to create music, poems, prose, paintings, and even pictures of celebrities that do not exist.
Then there is the recent proposal of the capsule network from the legendary Geoff Hinton, now a VP with Google. The capsule extends the neuron by having it output not a scalar but a vector to encode richer information. The capsule network also uses a dynamic routing mechanism to move information between capsules. The capsule network is meant to replace the hugely successful convolutional neural network. It captures the hierarchical relationships between object parts, which slashes the error rate of the convolutional neural network by a whopping 45%. The capsule network needs to see far fewer pictures to accomplish the same image recognition task. Also, flipping the picture upside down or presenting it from an angle does not bother the capsule network at all. The convolutional neural network, by contrast, is completely taken aback.
Figure 1: Examples of Recent Breakthroughs in Machine Learning

Another technique worth mentioning has to do with visualization. We need 2D visualization because we cannot see what is in n-dimensional space. The task at hand is to preserve neighborhood locality, which means that points that are close to each other in n-dimensional space need to be close to each other in the projected 2D space. As for points that are far apart in n-dimensional space, we do not really care how far apart they are in 2D provided that they are not too close to each other. A new technique has emerged to accomplish this task and it is the UMAP  (Uniform Manifold Approximation and Projection). The claim is that it does better than the very successful t-NSE, a technique developed by Laurens Van der Maaten and Geoff Hinton. UMAP uses a fuzzy topological structure to model the information in n-dimensional space.
Another advantage of jumping on the Machine Learning bandwagon is that your programs may show improvement in performance with little intervention on your part.  That’s because the platform where your programs reside is constantly updating its functionalities to catch up with the latest advancements. In TensorFlow, for instance,  if you replace the keyword “GradientDescentOptimizer” with “AdamOptimizer” when making a call to the optimizer, the training time of your model will be much shorter. What’s remarkable is this benign change in keyword belies a tremendous amount of work punctuated with a string of breakthroughs. Let’s take a closer look.
Optimization is about taking a series of steps from an initial spot to land on the optimum. Needless to say, the smaller the number of steps, the faster the algorithm. In the plain-vanilla stochastic gradient, the size of the step is proportional to the slope, which means that on a plateau, the step size is very small and the algorithm very slow. If, however, we remembered the steepness of the slope we just rode down, we could use that momentum to move forward at a much faster clip. In other words, we could have the slope determine not the speed but the acceleration. Of course, we need to add some friction to ensure we do not exceed a terminal velocity, otherwise we’ll zoom past the optimum. That’s the idea behind momentum optimization. Interestingly, we can do better than that. Instead of taking the slope at the point where we are currently at, we can take the slope at a point a little further away in the direction of the momentum. This idea works well because in general the momentum points in the direction of the optimum. This improvement is known as the Nesterov Accelerated gradient.
Yet another strategy consists of fiddling with the learning rate. The learning rate controls the size of the step. Too small a learning rate and the algorithm takes forever. Too large a learning rate and the algorithm cannot find the optimum. AdaGrad uses the fact that the slope along one dimension may be steeper than the slope along another. It applies a decay factor to the learning rate and does so in such a way that the learning rate decays faster along dimensions of steep slope and slower along dimensions of gentle slope.  Yet another improvement comes from RMSProp. While AdaGrad remembers the slope history of all the points that were visited, RMSProp only remembers the most recent ones. This results in a speedup while not overshooting the optimum.
Finally, Adam which stands for Adaptive Moment Estimation leverages all these ideas and combines momentum optimization with RMSProp. It is simply the best optimization algorithm out there to date.
Eight Key Problems that Lend Themselves to Machine Learning
Below are eight problems that are central to Commercial Analytics and lend themselves to Machine Learning. Indeed, they all satisfy the three conditions for Machine Learning. First, there is a good amount of data. Second, a pattern exists and it’s not random. Third, the alternative rule-based approach to capture the pattern becomes quickly unwieldy, putting a lid on further improvement. What’s more, we can expect to see very powerful solutions come out of this new approach. They may even transform Commercial Analytics as we know it.
Figure 2: Key Problems that Lend Themselves to Machine Learning
  1. Patient Identification –  Which patients are most likely to use our drug? This is a key question in Rare Diseases and Oncology where patients are few and the cost of therapy per patient per year is very high. We already have a fair amount of data today. They are Syndicated Claims data, EMR data, SP/SD data, and GPO data. Syndicated Claims data is a great resource as it describes the sequence of interactions of the patient with the healthcare system including doctor visits, lab tests and results, drugs prescribed, drugs administered in the office, surgical and other procedures, hospitalizations, and the like. EMR data is also a great resource even if does not identify the physician. It contains much richer data on the patient including line of therapy, lab results, vitals, family history, hospitalizations, physician notes, and the like. SP data is relevant because it provides a more complete view of our drug than syndicated data sources do. Also, it provides us with a yardstick to estimate the capture rate of competitive drugs in the syndicated data sources.  Finally, GPO data informs us of drug usage in real time and can be used as site alerts for our reps to act upon.

  2. Physician Identification – Which physicians will prescribe our drug? This question is always relevant since more prescribing physicians means more revenue, regardless of the stage of the drug in the product life cycle.  Several factors determine if a physician will prescribe or not.  First, the physician must have patients in need of the product. Since the number of eligible patients goes up and down over time, the physician may come across as fickle, making it difficult to predict who will prescribe at a given time. Next, the patient must be able to afford the out-of-pocket costs and that is in part a function of the insurance plan of the patient. Then, the physician must not dislike the drug, which is shaped by past experience, habits, and profile. This is captured by the residency program, hospital affiliations, involvement in clinical trials, speaking engagements, KOL status, consulting work done on behalf of pharma companies, where the physician is on the innovator-laggard spectrum, volume of patient referrals, and so on. Last, the physician may not respond to the promotional message unless it is delivered through the right channel for that physician. By the way, another complicating factor has to do with the tacit ROI assumption. We do not want to identify all physicians that will prescribe, only physicians that will prescribe within a promotional budget. At any rate, relevant databases include patient level data, formulary access, physician profile, and channel preference.

  3. Promo Sequencing – Once we have established which physicians to target, the next problem to address is execution. Take Dr. John Smith. Which of the following two sequences is more impactful: (1) C, C, NP, S, E, L or (2) L, C, NP, C, E, S where C stands for Call with sample, NP for No Promotion, S for sample, E for Email and L for Lunch? Actually, why limit ourselves to only those two sequences as there may be a third sequence of the same or lower cost that may be more effective?  More generally, what is the optimal sequence for each physician given a promotional spend? There is one type of Machine Learning that works well for this type of problem and it is Reinforcement Learning. Reinforcement Learning sits in between Supervised and Unsupervised learning. In Supervised learning, there is a label or class for each example and our task is to find the label or class of a new example. In unsupervised learning, there is no such thing as a label or class. There are only examples and they need to be clustered along similarities that are to be uncovered from the data. Parenthetically, Reinforcement Learning is the workhorse algorithm behind AlphaGo, AlphaGo Zero, and the self-driving car. It’s what Google was after when it shelled out $500 million in Jan 2014 for DeepMind.

  4. Patient Response to Drug – Which drug will a patient respond to or show better response to? The converse is just as important: Which drug will a patient not respond to, not tolerate, or have an adverse event to? Either way, the underlying question is the same. Is there a patient profile for each type of response: Respond well, Respond, Do not respond, Do not tolerate, or Experience Adverse event.  Claims data is helpful as it describes the drugs the patient has been on including combination and concomitant therapies, diagnoses, surgeries and procedures, lab tests, hospitalizations, and the like.  The data is limited in that its description of the patient profile does not go beyond age, gender, ethnicity, and geography. EMR data offers a richer profile of the patient and is a great data asset to leverage. What’s more, it has physician notes which may come in handy. The best data source though is arguably clinical trials data.

  5. Disease Identification from Claims Data – Is this asthma or COPD? Type 1 or Type 2 diabetes? Bipolar or depression? Metastatic or not metastatic cancer? This question comes up whenever the drug has multiple indications or is used off-label. The diagnosis code can help resolve the matter but has its limitations. For starters, the diagnosis code may not be present in the claim.  Also, the claim may indicate a made-up diagnosis to ensure that the patient gets the drug. This administrative workaround is employed when the Payer will only reimburse the drug for a specific indication and that’s not the indication the physician had in mind. There are two business reasons that motivate the question. One has to do with Incentive Compensation. The typical scenario is the drug just got approved for a second or third indication. The sales force needs to direct its effort toward the newest indication, and, to that end, the pharma company rolls out an Incentive Compensation plan that only pays Reps for Rx’s written for the new indication. The other reason is profit sharing, typically, between a startup that owns the molecule and a big pharma company that has an army of reps to promote the drug. Since only one company does the promotion, a natural arrangement is to split sales based on indication.

  6. Line of Therapy Determination –  This question comes up when we use claims data to figure out patient journey, and that’s because claims data does not indicate when a line of therapy ends and another starts. It has to be inferred. This question also comes up with GPO data despite the fact that the GPO data indicates line of therapy. What the GPO calls first line of therapy may not be first line at all, simply the first time the GPO services the patient. The EMR is arguably the best source of line of therapy information although physicians do not always agree with each other, and this transpires in the EMR data. Overall, there is consensus for the most part.  Another approach is to get medical experts in a room, show them several examples, and have them to articulate the rules that define lines of therapy. This may lead to heated debates but they’ll get the job done. Now, you do not know how much of these rules is shared by the larger medical community and how much is specific to your handpicked experts. That’s another reason Machine Learning is so appealing. As for the business questions that require a better understanding of lines of therapy, here are the common ones. What is the market share of our drug within a line of therapy? How fast do patients move through the different lines of therapy? If our drug is used in second line, who are the patients in first line that are most likely to move to second line and stand to benefit from our drug?

  7. Market Access –  It is well known that about half of formulary changes have no impact whatsoever on the prescribing behavior of physicians, which means that the other half does. This unleashes a series of questions. What type of formulary changes are material: changes in tier that lead to a significant difference in co-pay, Prior Authorization, Step Therapy, NDC Block, Quantity Limited, etc. Who are the Payers that are enforcing those changes and in which MSA’s (Metropolitan Statistical Area)?  In the traditional analytical approach, when we measure the impact of a change, we have to zero in on one change and assign the impact solely to that change. However, if we deploy a Machine Learning approach, the algorithm may factor in not only the change of interest but also changes that happened before, at the same time, and after the change of interest. By bringing in the context, the algorithm may more accurately predict the impact of the change we are contemplating through contracting with the Payer. Also, the Machine Learning approach will pick up spillover should there be spillover as it will be looking at the larger picture. The relevant data sources include patient-level data, physician-level prescriptions, and formulary changes.

  8. Shipment Optimization at the SP – SP’s face a major problem and that’s costing pharma companies a lot. Indeed, SP’s need to get approval from the Payer before they can ship the product to the patient. This approval process is very slow. On average, the time between writing of the prescription and shipment of the drug is in excess of 30 days. Patients cannot wait that long, so many abandon the prescription or end up using a different drug, resulting in significant loss in sales. Now, if the SP could predict which requests the Payer will grant, the SP could skip the wait and ship the drug right away.  That would solve both the abandonment and switch-away problem. Why not use a Machine Learning algorithm to sort out which requests will be approved and which requests will be rejected? For starters, the SP has lots of data regarding which requests were approved, rejected, approved after the rejection is overturned, and rejected for good. For each of these cases, the SP has information on the patient, the physician, the payer, the insurance plan, the prescription, and so on. Of course, there will be false positives. From time to time, the SP will be left holding the bag. It would have shipped the drug to the patient and the Payer would subsequently deny reimbursement. This prompts us to ask if this early shipment strategy will work. To be sure, it will if the accuracy of the algorithm is such that the new incremental sales dwarf the losses incurred by the false positives. To play it safe, the SP could choose to ship early only to patients where the probability of reimbursement is extremely high.
Lessons from Machine Learning Projects
What have we learned from the Machine Learning projects we’ve done? Four things. First, data is king. You will not get very far no matter how hard you try if you do not have the right data. Invest in getting the best data for the job. Second, do not underestimate feature engineering. Feature engineering unpacks information that is already available in the data. By making explicit what is implicit, it increases the predictive power of the classifier. This point is not fully appreciated though. That’s because it is very tempting to embrace the romantic belief that if the information is in the data, somehow the algorithm will ferret it out. We wish that were true. Third, the algorithm. There are potentially several algorithms one may deploy for the task. There is no one algorithm that is good for all instances of a problem, otherwise there will be just one. Be open to the possibility that the best algorithm may not be your favorite and can even be one that you consider sub-par. In sum, explore and only then pick the algorithm.  Fourth, beware of MINA! That’s our acronym for Missing is Now Absent. We’ll explain why it is so treacherous.
A. Data is King
The reason data is so crucial is because it is at the heart of how Machine Learning operates. In an expert system, for instance, we impart knowledge to the system by defining if-then rules that the system follows to draw inferences or take action when presented a new situation. In machine learning, by contrast, it is up to the system to figure out the rules it needs to deploy when presented a new situation. That’s why it needs to see a lot of data. Obviously, the more data the better. Here is an example.

The task at hand here is to predict the prescribing behavior of physicians given their profile: age and gender, school attended, residency program, size of group practice, privileges in reputable hospitals, allegiance to pharma companies, indifference to drug pricing, role in patient referrals, star power as measured by paid-for trips, etc.  We used boosted trees and got the AUC (Area Under Curve) to a very respectable 0.8.

Now, we all know that the Rx behavior of the physician is also contingent upon the behavior of the patients.  For sure, the physician needs to put pen to paper but unless the patient hands over her money to the pharmacist, the prescription is not filled.  It dawned on us that what was missing is a database that describes patient behavior at an aggregate level, which led us to develop a Panoramic Contextual database. It captures a whole array of dynamics that influence the prescription filling behavior of patients at the zip level and higher. They include:
  1. Leading indicators of disease (cancer, cardiovascular, asthma, arthritis, mental health, COPD, CKD, etc.)
  2. Incidence of Cancer (breast, cervix, leukemia, NHL, pancreas, prostate, bladder thyroid, etc.)
  3. Exercise and fitness level  (Fitbit data, fruits and veggie consumption, etc.)
  4. Habits (hours of TV watching, soft drink consumption, smoking, e-cigarette, binge drinking, etc.)
  5. Health Awareness (PAP smear, dentist visits, loss of teeth, etc.)
  6. Education level (high school, associate degree, college, etc.)
  7. Use of digital devices (computer, internet, etc.)
  8. Taxes (gross income, taxable income, expected refund, etc.)
  9. Crime (armed robbery, burglary, rape, arson, embezzlement, larceny, etc.)
  10. Pollution (SO4, SO2, NO3, HNO3, NH4,  Mg, Na, Ca, K, Cl, etc.)
  11. Climate (UB Exposure, Precipitation)
  12. Political Leaning (presidential voting results)
  13. Business Presence – number and size of employers
  14. Insurance coverage (e.g., Medicare Enrollment)
  15. Road Traffic and Commuter Stress index

The enriched model now has access to both the profile of the physician and the aggregate dynamics of the patients of the physician to predict the prescribing behavior of the physician. We kept the Boosted Trees just as before and saw the AUC zoom past 0.9.  Such a significant boost in performance is compelling evidence that it is worth investing in the data.
B. Feature Engineering
You have identified and leveraged all the relevant data assets you can lay your hands on. And still, the predictive power of your model lags behind. Somehow the model is not hitting on all cylinders. What’s wrong? Well, there may be an issue with feature engineering. 
Feature engineering is about making explicit what is implicit in the data. It unpacks information that is already available in the data through the creation of new variables from existing variables. Here is a colorful example from Kaggle. In one of the competitions, the task was to predict which car has the highest resale value. At first blush, anything could be a predictor: make and model, year, price of new, mileage, horsepower, weight, height, color, diameter of wheels, built-in GPS, AWD, and the list goes on. It turns out that the best predictor is the color of the car, but with a twist. Indeed, it has to be an unusual color for that type of car. If all medium-size sedans are, say, white, then yellow would do it. Rationale? People who purchase cars of unusual colors tend to be car buffs and they take very good care of their cars. As a result, the car is in such good condition that it fetches a handsome price at resale. For the record, this unusual color feature won first place. Note that this feature, namely, unusual color for the type of car, is akin to a standard deviation relative to a subset of the database.
Let’s go back to the problem of predicting the prescribing behavior of physicians, and discuss a few feature engineering examples.

Say we are looking at an expensive drug. We’d like to have a predictor variable that captures the insensitivity of the physician to drug pricing. To that end, we can look at all the drugs the physician writes, rank order them by price, and look at, say, the 80th percentile. If that price is high, we can conclude that the physician is insensitive to drug prices.  Another approach is to look at the share of branded drugs relative to generics.

The reluctance of a physician to prescribe a drug may have to do with the physician’s financial involvement with other pharma companies, which as we know, is described in the Open Payments database (Sunshine Act). By looking at payments a physician perceives from pharma companies, we can develop an allegiance index that indicates if the physician is strongly tied to one company or is open to developing new relationships with other companies.
It’s always helpful to know who are the sought-after physicians.  One way to do so is to look at the number of trips a physician takes on pharma’s dime, and even at a breakdown of these trips by in-town, domestic, and international.  Looking at year-on-year changes, we can also define features that describe how the star power of the physician is trending: rising, falling, steady, or wobbly.
Another great source of data for feature engineering is patient referrals. Looking at the data as a graph where nodes represent physicians and arcs referrals between physicians, we can establish how well a physician is connected to other physicians.  Indeed, there is a whole host of centrality measures that we can deploy including degree, PageRank, eigenvector, closeness, in-betweenness, etc.

C. Algorithm
The choice of the algorithm should only be of concern once we are done with data acquisition and feature engineering.  In other words, we are fully satisfied that we are deploying the best data assets for the job. Also, we have leveraged our domain expertise to the full and have made explicit all the key features that the model may need to do its job.  Only then should we turn our attention to algorithm selection.

There are indeed quite a few algorithms to choose from. If we had to pick one right away, we’d probably start with Boosted Trees. In recent years, Boosted Trees won more Kaggle competitions than any other techniques. Before that, the reigning king was Random Forests and that’s a good choice too. Before that, it was SVMs (Support Vector Machines) and that’s not a bad choice either.

The fact of the matter is that each algorithm splits the n-dimensional feature space differently as it undergoes the process of separating the subjects to classify. Since the problem at hand distributes the subjects to classify in a very particular configuration in space, one algorithm is bound to work better than others. The issue is that algorithm may not be your favorite one, say, multilayered perceptron (MLP). It may even be an algorithm that you consider to be inferior (e.g., Naive Bayes) or not sophisticated enough (e.g., logistic regression) or of a different flavor than the one you are familiar with (e.g., kernelized SVM with radial basis functions). When that’s the case, you would have missed the winning algorithm.
A study conducted recently at UPenn by Olson et al. compared the performance of 13 algorithms on 165 publicly available classification biomedical problems. Here is the finding. The top three algorithms are: Boosted Trees, Random Forests and SVM’s. The bottom three algorithms are variations around Naïve Bayes: Bernouilli, Gaussian, and Multinomial.  Also, for any of the 165 problems, one of the 13 algorithms came on top, which means that the “worst” algorithm (based on overall ranking) turned out to be the best for the problem at hand. What if that were the problem you were solving? Since you may not know ahead of time which algorithm is going to be the winner, a good policy may be to drop your prejudice and give all of them a chance.

In regard to the problems we worked on, and we did work on quite a few, there was always an algorithm that did better than others. However, not by much. When evaluating a model, we follow a procedure known as n-fold cross-validation. Here is how it works. Say we are looking at 10,000 subjects and n is 10. We first pull out the first 1000 subjects (1 to 1000) and train the model on the remaining 9,000 subjects. We test the model on the 1000 subjects that we pulled and that’s one score. Next, we pull out the next 1000 subjects (1001 to 2000) and train the model on the remaining 9,000 subjects. We test the model on the 1000 subjects that we pulled and that’s the second score. We repeat this process 10 times and take the worst of the 10 scores to be the score of the algorithm. Here’s what we observed. For a couple of folds and sometimes several, the fold score of the runner-up algorithms is better than some of the fold scores of the winning algorithm.

Of course, the ideal is to identify the best algorithm for the job. The truth of the matter is that even if you miss and pick the second or third algorithm, things are not that bad. What this suggests is that you may be better off investing more time and energy in data deployment and feature engineering than sweating over algorithm selection.

D.  Beware of MINA
MINA is an acronym we coined for “Missing is now Absent” to refer to a phenomenon that can wreak havoc in Machine Learning models. It is best explained with examples.
Figure 3: Problems Arise When What Was Missing in the Data Is Now Absent in the Real World

You observe over time that when a colleague is sluggish before lunch, the colleague is energetic in the afternoon. The next time you see a sluggish colleague in the morning, you predict that the colleague will be energetic in the afternoon. And you are right. One day, to your surprise, all your sluggish colleagues look drowsy in the afternoon. What happened? You discover that the coffee machine is broken and understand that the afternoon source of energy has been disrupted. Here’s the point. The fact that the coffee machine is missing in your mental model is immaterial so long as it is there in real life. Problems start the day the coffee machine is absent in real life. Indeed, things take a different turn and you cannot explain why.

Here’s another example from the triage of pneumonia patients in ER. The data suggests that patients that have pneumonia and asthma do extremely well and patients that have pneumonia but not asthma do just fine. As a result, the recommendation of the machine learning algorithm is to de-prioritize patients that have asthma.  That’s actually a very bad idea. The reason patients that have pneumonia and asthma do extremely well is because they are high-risk patients and, as a result, are given special care. What’s causing the great outcome is the care, not the asthma. The algorithm’s recommendation misses the point and suggests getting rid of the special care. That’s because care has been missing in the data all along.
In both cases, something momentous happened. The cause has disappeared in real life (coffee machine, special care) along with its implications. But to the data, nothing has changed. The model does not know about the change since the cause was never captured.  As a result, the algorithm makes the same prediction as before, but this time it is off. (Figure 3)
The fix? Explain the prediction. Why are my colleagues so full of energy in the afternoon? Why do the pneumonia and asthma patients do so well? If the subject-matter expert cannot explain the recommendation based on a description of the situation as captured in the data, something important is missing (coffee machine, special care). In that case, we should refrain from following the recommendations of the Machine Learning model. Better be safe than sorry.
As discussed throughout this paper, now is a great time to get started with Machine Learning. The field is making progress by leaps and bounds. There is a vibrant community of practitioners across virtually all verticals. There are several open-source platforms to choose from and countless resources to turn to. There are also cloud-based solutions and specialized hardware should you require serious scalability.  What’s more, it’s still early morning on pharma’s clock and there are countless opportunities to seize.
Be ready for challenges. If what you read makes you feel you are lagging behind, take the write-up with a pinch of salt.  Many who write about Machine Learning are not practitioners and have not wrestled with the myriad of problems that bedevil the task. So, they naturally paint a rosy picture and even though they do not mean to mislead, they do.  Think about it. Who, apart from the practitioner, wants to hear an exposé of challenges and nuances that can only blur an otherwise perfect picture?
About the Author
Jean-Patrick Tsang is the Founder and President of Bayser, a Chicago-based consulting firm dedicated to pharmaceuticals sales and marketing. JP has worked on 250+ projects to date including ROI optimization, data strategy, and segmentation & targeting. For the last few years, JP and his team have been focusing on Predictive Analytics using Machine Learning. JP publishes and gives talks on a regular basis and runs one-day classes on various subjects related to data and analysis. In a previous life, JP deployed Artificial Intelligence to automate the design of payloads for satellites and was the adviser of two PhD Students.

JP holds a Ph.D. in Artificial Intelligence from Grenoble University and an MBA from INSEAD in France. He was also the Recipient of the PMSA Lifetime Achievement Award in 2015. He can be reached at (847) 920-1000 or This email address is being protected from spambots. You need JavaScript enabled to view it..