Phrasing business questions properly for data science may be tricky. Incorporating time into the way questions are posed is a helpful way to get precise about business metrics.
A current state-of-the-art automated machine learning platform consumes a dataset and takes care of:
- Deciding what kind of problem you are dealing with (regression, classification, clustering, time-series analysis, etc.)
- Deciding what algorithms are relevant for that problem type
- Training and tuning those algorithms
- Evaluating how well those algorithms perform
- Putting the best performing algorithm into production
There is little room for error in this process, so you might be wondering: what is left for the user to do? The answer is simple: the user needs to ask the business question.
Data scientists, like all scientists and mathematicians, are familiar with the importance of posing questions in a precise manner. In business, however, language may be vague or ambiguous. It is not uncommon for conversations to take place around, say, customer lifetime value (CLV), without first providing an exact mathematical definition of CLV. Such a definition might not exist in some organizations or, more confusingly, several inconsistent definitions might be in use. As in everyday life, a lot can be accomplished in business with imprecise conversations because people are masters of disambiguation, inferring from context, and filling in the blanks.
Conversely, when using machine learning to answer business questions we are solving a math problem which, like any math problem, must be stated precisely. When experienced data scientists are tackling such projects, this is taken for granted. Currently, however, practical machine learning is being revolutionized with high- level tools that increasingly automate data preparation and model production.
This revolution has vastly expanded and will continue to expand the circle of machine learning practitioners. Thus, the rise of the Citizen Data Scientist: traditional business analysts, data engineers, and even executives are now leveraging high-level libraries or automated machine learning platforms to consume their data, train machine learning models, and put them into production. This has allowed companies to either dramatically scale their existing data science teams or easily (and cheaply) start a data science practice from scratch.
While having technology that allows non-experts to easily create and deploy machine learning models is a fantastic and powerful thing, we must be careful to use such technology, like any other great technology, with care.
Machine learning algorithms don’t know the business question you are trying to answer; they simply find patterns in the data you feed them and make predictions based on that. It is a business user’s responsibility to define a question precisely and aggregate a dataset that supports that question. By helping customers adopt our automated machine learning platform, I have seen enough of their business analysts evolve into citizen data scientists to identify common pitfalls in that journey. These pitfalls are always related to the way business questions are stated: they are not stated precisely enough.
Furthermore, I have noticed that, by far, the most common way in which this question-posing is imprecise is related to not incorporating time into the question. By incorporating a time-frame into the way business questions are asked, we are more likely to ask precise questions that machine learning can shed light on. Here I give practical advice on how to do this. Instead of being general and abstract, I will give a couple of concrete examples. The hope is that this provides a good sense of how to nail this for other problems down the road.
Customer Lifetime Value: A Regression Example
A common business concern of our partners is increasing their customer lifetime value. This is a perfect example of a business problem that machine learning can shed light on. At first glance, we might reason that a machine learning model trained on historic customers might learn to predict customer lifetime value and give a sense of what features drive that prediction. We might define CLV as the total amount of money a person brings in during their history as a customer. We could then put together a spreadsheet with historic information from all customers (past and present), something like this:
In this dataset, we have one row for each customer, and each column gives a relevant feature describing that customer: their customer ID, gender, age, the date when they became a customer, their zip code, the number of purchases they have made, and their total monetary spend. We could define the total monetary spend of a customer (Total_Spend) as CLV, feed this dataset to an automated machine learning platform, and have it learn to predict Total_Spend from these examples. When new customers are acquired in the future we could use that trained algorithm to predict Total_Spend and get a sense of how much monetary value they will provide during their customer life.
There are several problems with this approach. Think about it, the dataset we put together for this use case may include people who have been customers for one day, one month, or one year. The Total_Spend column in this dataset does not reflect the total money a customer will provide during their lifetime; it is the total money they have provided to-date. Also, a customer that is one day old might have the characteristics of a stellar customer (maybe it’s winter and they live in a region where people of their age and gender tend to buy a lot of our products when it’s cold), but because they just became a customer yesterday, they have only made one purchase and have not yet spent a lot of money. By including them in the training dataset, we are incorrectly teaching our machine learning algorithm that they are the type of customer who does not bring in much money.
We might have a one-month-old customer that has been ordering products 3 times a week, totaling 12 purchases. Another customer that has been around for a year and purchasing once per month might have spent the same amount of money. Our machine learning algorithm would put these two customers on equal footing in terms of CLV, when in reality the one-month-old customer might be significantly more valuable in the long run.
To avoid these pitfalls, we need to get precise about how we define CLV and about how we prepare a dataset for the problem. A good way to do this is to think about incorporating time into our definition. We could, for example, choose to define first-year value (FYV) as the total money a customer spends on their first year as a customer. We could then decide to use a customer’s behavior during their first 3 months, say, as features to predict their total spend over their first year. FYV is a precise definition of a metric of interest that incorporates time (we only look at a one-year timeframe for each customer). The advantage of creating such a precisely defined metric is that it puts all examples from our training dataset on equal footing. Note that since we are now looking at the total money people spent during their first year as customers, we must limit our training dataset to customers that have been around for at least one year. Now we can prepare a dataset that looks like this:
Here, each row represents a customer that has been around for at least a year. The columns include features that describe the customer at the moment they were acquired (CustomerID, Gender, Age, Start_Date, Zip) as well as features that represent the customer’s activity during a chosen timeframe, such as the number of purchases they made in their first 3 months (NBR_Purchases_3mths) and the total monetary spend on their first 3 months (Total_Spend_3mths). The target column (Total_Spend_1yr) represents the total money they spent on their first year and that is what we will call first-year value (FYV) and what we will teach our machine learning algorithm to predict.
Notice how we are now asking a very precise question that is defined within a timeframe. We are predicting how much money a customer will bring in during their first year, based on their behavior during their first 3 months. It is up to a business analyst to use their valuable domain expertise and pick time frames that make sense for their industry. Knowing what business question to ask, being able to frame it in this precise way, and aggregating a training dataset where all examples are on equal footing (where we are, so to speak, comparing apples to apples) is the valuable contribution a citizen data scientist brings to the table.
Customer Retention: A Classification Example
Customer retention is another common use case that is a good candidate for machine learning. For a company that offers a subscription-based model, for example, we might go back and label all past and current customers as having either canceled their subscription (“churned”) or not. We could put together a spreadsheet that looks like this:
Here, each row would represent a unique customer and the columns would represent different features describing that customer. The last column could be our target: a binary column specifying if the customer has canceled their subscription (Yes or No). We could train a machine learning algorithm on this dataset to predict if any given customer will churn.
Again, this approach is full of problems similar to what we initially discussed in our regression example. First of all, this dataset may include customers with wildly different tenure lengths. We are comparing apples to oranges by comparing new and old customers, and for the customers that have not canceled, we have no information about whether or not they will cancel down the road. Newly acquired customers may have all the characteristics of a terrible customer (maybe we know that males in their twenties who don’t buy much in their first month tend to cancel their subscription soon after) and since they are fairly new and haven’t cancelled yet, we are training our machine learning algorithm to associate those characteristics with a good customer that has not cancelled.
As before, the way to avoid these pitfalls is to get precise about how we define churn and about how we prepare a dataset for the problem. Let’s consider incorporating time into the question. We could choose to study which customers are going to cancel their services within their first 6 months. We may, for example, use their behavior during their first customer month to predict whether or not they will churn within the first 6 months. Now we have a precise way of defining customer churn, a way that incorporates a timeframe. We could aggregate a dataset somewhat like this:
Here each row represents a customer, but now we only include customers that have historically lasted at least 6 months, and for all of them we use their number of purchases and total spend during the first month to predict whether or not they churned after 6 months. For the purposes of this question, it has become irrelevant whether or not they churned after their first 6 months; our target column only tells us whether or not they canceled their subscription in their first 6 months. Now we have a training dataset where all rows are on equal footing and in which we are comparing apples to apples. Once we train a model on this dataset, we can take any new customer that has been around for at least one month and use their behavior during their first month and our trained model to predict whether or not they will churn during their first six months.
Getting a sense of how to ask business questions in a precise and appropriate way so that they can be tackled by machine learning is something that comes with practice, but seeing examples of how to do this in good and bad ways is helpful when getting started in machine learning for business applications. If you are unsure about how to frame your business questions for machine learning, consider incorporating a time frame into the definition of your business metrics; this strategy often goes a long way. Happy model building!