How I Used a Market Basket Analysis to Get a Job Offer

Time to read:

29 minutes

During a Senior Data Analyst job interview I was asked to carry out a take home data analysis task and present my findings…

The task: identify accounts that were at risk of cancelling their subscription. Or in other words churn analysis.

I used predictive analytics to solve this. I carried out a bit of machine learning called associated rules mining. Or market basket analysis.

This was my first commercial opportunity to carry out predictive analytics.

There were some challenges along the way. Encoding challenges. I had to first turn continuous data into categorical data…

…before I could one hot encode the data. Associated rules mining only works on binary data.

Once the data was ready, I hit another challenge. Tweaking the number of columns and the analysis parameters so it wasn’t computationally expensive.

…And once I had my results, I had to create a data story to present my findings. To provide business recommendations.

Read on to find out how associated rules mining or market basket analysis works.

Why it’s a good choice for churn analysis. How I overcame the challenges of encoding and computation time…

…and how it led to a job offer. Which I turned down.

I did some data analytics for a job interview. A bit of predictive analytics. It was for a Senior Data Analyst job interview. Specifically, I did a market basket analysis (or association rule mining or association rule learning – they mean the same thing). It’s machine learning. And I then had to present my findings.

Yeah. It was a lot.

The goal was to identify why some customers had cancelled their subscription (or churned); who they were; and provide recommendations on how to prevent them from churning.

After presenting, I decided to write this up. Not just because I like the sound of my voice, but because:

Hey, this was cool. A bit of predictive analytics, and a bit of machine learning? Of course, this was going on my portfolio.
I’m not a Data Scientist. I’m just a guy who works with data (I was a Data Journalist for five years). If I can learn this, and you work with data on a day-to-day basis, you can probably learn how to do this too.
I ran into a few encoding speed bumps. Writing about it helps me learn from this. Maybe someone else can learn from it too.
Data can come across as mysterious or magical. By breaking this project down and explaining it, I hope I can make it less scary to some. More accessible.

In this visual data story, you’ll learn:

What market basket analysis is, and why it’s one of the go-to predictive analytics solutions for business data problems.
How to encode your data for market basket analysis.
How to carry out market basket analysis in Python (using the mlxtend library)

Before I go on, I mentioned I did this for a job interview. You may wonder… did he get the job? What were the results of this gargantuan task?

Not only did the analysis identify how to recover $1 million from $5.6 million in lost revenue¹, but it also led to a job offer (despite my bad performance in the second stage of the interview²). I ended up turning down the offer in favour of another job role in an industry more in line with my career direction. However, I believe this presentation task (and the analysis carried out) was what helped lead to an offer.

The Customer Churn Case Study Brief

I wrote the brief in detail, in case you’re interested in that sorta thing. If you want to skip straight to the problem solving, you can go here.

ERS provide access to Scientific, Technical and Medical research and insights through their platform. They’ve experienced rapid growth since their formation in 2012, with approximately $40 million in subscription revenue recorded in 2016.

ERS was recently acquired (2017) by TRS, a larger provider of research insights.

ERS growth between 2015 and 2016 was quite high, growing 28% year on year. During the acquisition process, the existing ERS management team were asked to conduct a business review to identify opportunities or risks to continued revenue growth.

During the review, the Head of Sales highlighted an alarming issue. He produced the graph below:

The management team were alarmed at the revenue they lost in 2016 through lapsed/churned accounts. With new business compensating so well for this revenue, they hadn’t realised the loss on existing accounts was so large, and are concerned this could be a major risk to future revenue growth.

Every customer receives the same treatment when they renew each year. The management team want to understand how to combat this attrition risk.

As the leading analyst within the business, you’ve been approached by the Head of Sales to help answer this question for the management team:

How can we identify accounts at risk of cancelling their subscription?

Customer Churn Data Provided

Through your contacts in the data team, you’ve been able to source five datasets to help answer the question. It covers a random sample of 10,000 ERS accounts between 2015 and 2016.

ACCOUNTS.csv – high-level attributes for each account, including the year the customer subscribed
NPS.csv – Net Promoter Score (NPS) survey results for account holders
REVENUE.csv – what subscriptions accounts subscribe to, and for how much (there are three options)
USAGE.csv – number of unique usage sessions for each subscription type
USERS.csv – total number of unique users associated with each account per subscription year

Also:

There are no duplicates
The data is clean, there are no data errors or inaccuracies, but there are NA values.

How I Approached the Problem – Churn Analysis

I chose to approach this as a churn analysis problem.

In a nutshell:

Here are the characteristics of all our customers/employees.
Here are those who churned.
Find out what those who churned share in common.
Provide recommendations on how we can stop them from churning.

And that’s lunch.

This requires predictive analytics – the use of data to predict future trends.³ In this case, looking at a subset of a population, and providing scenarios on how that set would behave, with probabilities attached.

Because the output required was categorical, rather than continuous data, I decided a market basket analysis approach would be best⁴.

(Don’t know the difference between continuous and categorical or discrete data? Don’t worry. I’ll explain later.)

What is Market Basket Analysis or Association Rules Mining?

Imagine if we run a grocery store.

And we want to find out what items are frequently bought together?

And we have the data of what different baskets have been bought by customers.

A market basket analysis, or associated rule mining, looks at all these baskets, and identifies the items that are frequently bought together.

The higher the lift, the more likely these goods are to be bought together.

The basket with a lift of 3, is a basket that is 3 times more likely to be bought together than normal.

We can also ask the question “what often gets bought with 𝑥?”. Where 𝑥 is a single item, e.g wine.

We can then see the basket of goods associated with wine or 𝑥.

Here’s the cool part. 𝑥 can be anything. In our example it doesn’t have to be a food item.

It can be a feature of your customers. E.g. repeat customers.

Or customers who spend above £50.

Or customers who never come again.

For this case study, we’re going to investigate the features of customers who churn. Those who cancel their subscription to the scientific publishing platform.

If you want to learn more about market basket analysis or association rules mining, you can check out the resources here:

The “association rules” or the series of factors are those things which the churned group will hold in common.⁵

For our dataset, this could be the country they’re based in, their annual revenue, and the number of employees.

For instance, if we found out the following set of association rules has a lift of 2.1x:

Country: Germany
Annual Revenue Category: 500m-1bn
Number of employees: 1001-5000

Then we know that customers from Germany, who make $500mn to $1bn, and have between 1001 and 5000 employees, are 2.1 times more likely to churn.

A rule is simply “is this factor true or false?”. So is this subscriber from Germany? True or false.⁶

I’ll interchange calling a series of rules that could or could not lead to churn as “rules association”, “series of factors”, or “series of variables”.

If you see any of these words, just think: stuff the group share in common.

Market Basket Analysis Challenges to Overcome

Of course, it wasn’t smooth sailing.

In fact, there was one moment I thought I’d be done in an hour, only to realise I had to start from scratch. It took many more hours – in fact, a couple of days – to be done.

(A few swear words were also said).

Here are challenges I faced:

Doing it outside of Microsoft Excel so I could get a list of association rules, instead of just the rule with maximum lift.
Using the mlxtend library in Python, after failing to make another library work.
One-hot encoding my data, after realising it wasn’t in the correct format for a market-basket analysis.
Discovering I had to turn the continuous data into categorical data before I one-hot encode it (unless I wanted the computation time to take forever).

I overcame them in this order:

One-hot encoded my now discrete data using pandas.

Found out how to do a market basket analysis in Python using a few libraries.

Abandoned another library (pycaret), which promised to be easier, but just plain didn’t work. Figured out how to use mlxtend.

Turned my continuous data into discrete data using pandas.

Carrying out a market basket analysis outside of Excel

You can carry out market basket analysis in Microsoft Excel using the Solver function. David Langer has an excellent video tutorial here.

However, using Microsoft Excel with Solver only provides you with one series of factors (or association rules). The one that has maximum lift. I wanted more than one rule – a list of association rules – ordered by lift.

This would help me better understand what is driving churn. If one factor appears multiple times, that implies it has a large impact. It could also help identify several cohorts of customers that are at risk.

I would need to leave the safe shores of Microsoft Excel…

Market Basket Analysis (or Association Rules Mining) with the Mlxtend Library

I kept coming across tutorials for creating a list of association rules in R; however, I can’t program in R.

But…I can program in Python.

Reading this tutorial on DataCamp, I found a method to carry out a market basket analysis with Python. The tutorial uses pycaret, a user-friendly library intended for those who aren’t data scientists. It does a lot of the heavy lifting, e.g. auto-encoding the data.

However, the latest version of pycaret has deprecated association rules mining.⁷ I tried to use a previous version of pycaret, but ran into previous dependencies not being compatible with my machine⁸. In short, dependency hell. Move on.

A few searches led me to believe that while mlxtend isn’t as user-friendly as pycaret, it’s more popular, and thus has more documentation and support. It will also provide me with what I need – a series of association rules with different levels of lift.

I knew the library I was going to use. I’d joined the data in Microsoft’s Power Query⁹. Time to get going, right?

Wrong.

See – there’s one little problem with carrying out association rules mining. One thing which pycaret would’ve done for me in the background.

Remember when I wrote this earlier:

A rule is simply “is this factor true or false?”. So is this subscriber from Germany? True or false.

Yeaaaaaaah. So, about that. My data didn’t exactly come in the form:

…	Country-Germany	Country-USA	…
	True	False
	False	True
	False	False

It came in the more common-sense form:

ID	…	Country	…
23553		Germany
23554		USA
23555		France

But to carry out an association rules mining analysis, the columns must be in true or false form only.

I was going to have my data prep work cut out for me.

Turning Continuous Data into Categorical Data using qcut in Pandas

To cast¹⁰ my data into a true or false form is called one-hot encoding. I could have just fast-forwarded this step and one-hot encoded my data, but then I’d have run into the following problem:

One hot encoding turns your data into a binary true or false form

If you do this with continuous data, without binning it, or turning it categorical you run into the following problem…

For example if we have the height of 10 students.

And we one hot encode this, we will get quite a few columns (9 because two people share the same height).

That’s just for one category or factor.

This project is looking at over ten columns or factors.

Associated Rule Mining is also computationally expensive. Computation goes up exponentially compared to the number of columns.

If we, instead, bin our heights. For example into low, medium, or high…

We get three columns instead of the 20 from earlier.

Which means, not just less computation time, but easier to digest insights.

I wanted to bin my continuous data into some highly sophisticated levels:

Low
Medium
High

Totally sophisticated.

After some reading, I used pandas.qcut to create bins in my continuous data.

Qcut¹¹bins a data series into a chosen number of quantiles. This worked perfectly with my super sophisticated levels of Low, Medium, and High.

I also included one more level – ‘Zero or Missing’. If data is missing or is zero, that’s important info. It, for instance, can tell us that the user was a customer in 2015, but not in 2016, and thus a lapsed/churned customer.

First we take our continuous data…

And we categorise (or bin) it into four levels. We do this with qcut in pandas.

And then we’re left…

With our categorised (or binned) data.

Want to see the code? Well, I’m keeping this article code light, because:

Stephen Hawking’s editor told him to remove the equations from A Brief History of Time, because “every equation will halve the market” – people switch off when they see code¹².
I already have a jupyter notebook write-up that you can read on my GitHub, complete with a table of pages. You can jump to the part where I use qcut there.
I’d really like to move on to the next fun step.

One Hot Encoding – or Turning Categorical Data into True or False Binary Data

Now that our data is categorised we can one hot encode it.

Here we cast (or widen) our dataset. There is a new column for each category. And we mark each piece of data as true where it falls into the category.

Look at the Binned_Zero_or_Missing values. Look at where they are true.

We do this for each category – notice where Binned_Low values are.

And Binned_Medium

And, finally Binned_High

Where the cell isn’t true, we mark as false.

This is how we one-hot encode our data. How we turn our categorical data, into binary data. Into true or false. One or Zero.

We can remove our original column now. Our script has all the information it needs to run a market basket analysis.

We’ve turned our lovely continuous data into four categories:

Zero or missing
Low
Medium
High

Now it’s time to do the final step before we can carry out our associated rules mining. To one-hot encode these categories into a series of columns with True or False as the result.

We can simply do this with pandas get_dummies() command. Get_dummies only works with strings, so I made sure all my numbers were encoded as strings before running.

Executing the Market Basket Analysis or Associated Rules Learning

Now our data is encoded properly, we can finally – finally – do our analysis. The actual data work that gives us our insight. Everything before was getting our ingredients ready before we cook the main meal.

But, hold your horses. We don’t just run mlxtend on all our data. Associated rules learning is computationally expensive. In fact, it’s exponentially expensive.

The more columns we have, the more it costs to compute.

Which makes sense, the algorithm is computing each individual permutation and finding out its lift.

The difference between running the right or wrong set of data can be 15 minutes versus a couple of hours on a MacBook Pro¹³.

The adage “garbage in, garbage out” also holds here – if you don’t filter your columns right, you’ll just get a bunch of association rules which make you go “no sh*t Sherlock”.

The first time I ran this analysis, I forgot to filter out the relevant columns. Not only did it take 45 minutes, but I got 50 or so rules which associated the “zero or missing” columns with the lapsed customers. This told me nothing. Why? Those columns were what defined a lapsed customer.

I dropped all those columns and began my analysis.

Using Parameters and Thresholds to Guide the Machine Learning

Our objective is to find the variables, or the series of factors, that when associated with each other, will likely lead to a churned customer.

Not to get too mathsy¹⁴, but three parameters identify the strength of an association rule:

Support
Confidence
Lift

Support is the probability of the event occurring in our dataset.

Confidence is a tricky one to explain. It is a measure of how likely these series of factors will occur together.¹⁵

Lift is what we’re after.

With mlxtend we can define the thresholds that these parameters will run. The higher we set these thresholds, the fewer rules we’ll get, but the higher quality they’ll be.

They will also shorten our computation time.

For statistical significance, we can think about support like a p-value. So, 5% will be our absolute minimum. I initially set this to 10% and received almost no rules. I dropped it to 5% afterwards.

For the confidence, I initially set that to 60%, and got no rules. I dropped it to 50% and got three rules (all with a lift above 2). Enough to give me the insights I needed for this case study.

For fun, I ran it again with a confidence level of 30%. The other rules that came up provided some context, but not enough to provide business recommendations.

Finally, I exported the results as a CSV.

The Results of My Market Basket Analysis (and telling the data story)

All that work…all that effort…but what were the results? What were the insights?

Older customers, with less than 4 customers, and with a missing NPS score were 3.12x more likely to churn.

What’s great about this result is that we can see how important the age of the account is. In all three cohorts, customers who joined in 2012 are likely to churn.

Job done right? Just send off this chart, and we can tick off this interview task?

Wrong.

Because, even if you have your insight, the best data people know it’s about the story you tell. That’s how stakeholders come away with actionable next steps.

My task was to create a presentation. Before I revealed my findings, I had to tell a story.

Building the Data Story

For our insights to have maximum impact, we need to build a story leading to the conclusion.

Now that I knew where to look: age of accounts, NPS scores, and number of users — I could quickly do some analysis to help build a credible story.

On the morning of the interview, I built out my presentation in PowerPoint and quickly ran some analysis¹⁶ using pivot tables in Excel. It took me less than half an hour. (Deadlines are powerful things.)

If I’d done this in the beginning, without knowing where to look, it would have taken me hours (possibly more than two days) to find these trends.

Predictive analytics isn’t always the answer, and it requires a certain skillset to carry out, but it can help sharpen the questions you ask of your data.

I created the following charts, comparing 2015 and 2016 customers, and lapsed and still subscribed customers:

First, I established context: how much were these churned customers worth? Answer: $297,000, or 18% of revenue.
Second, I looked at churn by number of users: the lower the user count, the more the churn.
Third, I explored revenue by account age: older accounts presented less revenue.
Finally, I hinted at the scale of the issue: 14% of accounts were in the oldest cohort.

This data story helped build towards our main conclusion: old accounts were the most likely to churn.

With that data story, we could create the presentation.

Presenting My Market Basket Analysis

The task called for no more than eight slides of content, including my methodology.

The entire purpose of this task, this case study, was to provide business recommendations.

No one cares about how amazing your data stack is or the tools you used. All they care about is

What is the problem?
How big is the problem?
What is your recommendation/what are the takeaways?
Is it grounded in data?

With that, I outlined my presentation:

Introduction
The problem and my approach
Context (the data story in the previous section)
Findings
Recommendations
Methodology (I)
Methodology (II)

You can view the presentation below:

My recommendations boiled down to:

Provide more support and attention to older accounts.
Make sure the needs of accounts with 1-4 users are being met. E.g. ensure they aren’t overpaying.
Encourage customers to do NPS surveys, as there’s a correlation between missing NPS surveys and churned accounts.¹⁷

Throughout the presentation, I made sure to bring all my findings back to revenue. How much were these customers worth? If they acted on the recommendations, how much revenue could they save?

If half of the accounts identified at risk are prevented from churning, TRS would save $850,000 to $1 million.

And finally, in the methodology section, I discussed my approach in a way the nerds would appreciate.

The Business Result of My Market Basket Analysis

What was the result of my presentation?

Despite my poor second interview performance, this case study and presentation were a hit. I received a job offer 24 hours later.

Like I mentioned, I chose to go with another job offer, as it was better aligned with my career direction.

Honestly? This offer was a surprise. I thought I’d blown my chances after my second interview. I think this presentation was strong enough to recover my application.

Lessons Learned from My Association Rules Mining Adventure

This was one of the first pieces of predictive analytics I’d done in Python. I’d carried them out in training and in exercises from textbooks, but I had yet to carry one out in a business setting.

I spent too much time trying to make pycaret work with outdated dependencies on my machine. In hindsight, I should have cut my losses earlier and started figuring out how to work with mlxtend.

I was reminded of the importance of filtering and pruning your data before carrying out machine learning. My first execution resulted in over one hundred rules, which were tautological and garbage.

Finally, I learned the importance of transforming and encoding your data correctly for your project. This is my first time one-hot encoding data.

If I were to Do Another Associated Rules Learning or Predictive Analytics Project

My biggest mistake (other than wasting time trying to make pycaret work) was running my associated rule mining script on “tautological” columns. As mentioned earlier, these created garbage output and lengthened my computation time.

If I were to carry out this exercise again, I would spend a bit of time thinking forward: which columns will be output in my one-hot encoded data? Which of these, by definition, means the user has churned? Mindful of these, I would exclude them at an appropriate stage in the data prep process.

Outro: The Data Iceberg

There’s something I’m going to call “the data iceberg”.

All that stakeholders and your audience see are the visualisations and the insights.

Maybe they’re aware of some of the underlying processes to get there. But, in my experience, most of them are not.

Some of the things they’re not aware of:

The data gathering process. The arm wrestling you did with stakeholders to get that data.
The hardware & software limitations you faced during the process, forcing you to make trade-offs.
An encoding problem that you didn’t foresee, which can be so complicated to explain to a non-technical stakeholder that it feels like you’re saying, “I can’t do the magic thing because the magic other thing isn’t in the magic way I need it to be.”
That email you had to chase on, to get an answer to the question of “when you say 2021 data, is that from January 1st, or is that from financial year start?”, which no one replied to, because they thought they’d put that in the metadata file, but they didn’t.

Now and then, I like to do a write-up of a data project, not just to illustrate this data iceberg to non-data people. But also, to myself.

In the middle of a project, when it’s taking twice as long as expected, the imposter syndrome kicks in: “I must be stupid”, “if I was better this task would be done by now”, “I need to hide this mistake, otherwise they’ll think I’m just a fraud”.

Write-ups like these give me not just a chance to reflect and see where I could improve. They also provide a sense of perspective.

They show the unexpected challenges that I overcame. They provide a scale of the work done to get what seems like a few insights.

They show just how valuable those insights are. This was the work required to get them. Hindsight bias kicks in, and you think “of course that’s the answer!”, but you couldn’t have known that without the work.

They prove, just for a short period, that maybe I’m not a fraud. Maybe I’m not an impostor. Maybe I do know what I’m doing.

And maybe, just maybe, I can take on a bigger data challenge.

If you have a project or a role you think I’d be a good fit for, reach out to me on LinkedIn. Alternatively, you can contact me by clicking the “Contact” link in the top menu, but be aware I get a lot of spam, so your email could get lost in the noise.

(Credit to Tima Miroshnichenko & Anna Nekrashevich for two photo assets used in the featured image.)

Footnotes

1
Fictional business scenario, but was inspired by real business problems the company faces.
2
The interview had three stages after the recruiter screening: 1) introduction/behavioural interview, 2) technical interview, 3) presentation task. The technical interview had a live data solving element which I didn’t complete (logistical regression), and I floundered some of my answers in the Q&A (though I did demonstrate my logical approach to breaking down and solving problems). My advise? This is par for the course. Every interview has a different set of data challenges, and some of them will be ones you’ve never faced in a business setting. All you can do is practice live interviews, fail them, and learn from them. I’ll never mess up a logistical regression in an interview again.
3
Catherine Cote, “What Is Predictive Analytics? 5 Examples”, Harvard Business School, 26 October 2021 https://online.hbs.edu/blog/post/predictive-analytics
4
A logistical progression would be more appropriate for continuous data.
5
There are two ways you can carry out a market basket analysis. Method one: you just look at which rules associate together. E.g you’re a grocery store and you want to see which goods tend to be bought together. So a basket of milk, flour, and eggs might have a lift of 4x if you have a street full of bakers. Method two: you freeze one rule, e.g red wine, and ask which basket of goods is associated with that, e.g steak and potatoes has a lift of 2x, fish and pasta 1.5x. Market basket analyses are agnostic. All that matters is the data is true or false. So you could have one column which is returning customers, and you want to know the basket of goods returning customers. Churn analysis is method two – you fix one of the rules, and see which is associated with that. Also – this is where the term “market basket analysis” comes from. It historically was used by grocery stores seeking to understand their customers by looking at a basket of goods. One day I’ll write about how the Tesco clubcard was the killer app that launched Tesco to the top of the food chain.
6
Those of you who are really switched on might see this foreshadowing of the encoding problem to come. Hint: the data isn’t in true or false form.
7
I spent more time than I care to admit reading and searching Pycaret’s knowledge base, running searches on Google, and querying Gemini and ChatGPT, to find out where on earth I could do to just simply run an association rules mining analysis, like I could in the previous version demoed in that tutorial. Perhaps I’m just wrong, and I should have googled harder. But life is short. Plus I had a deadline.
8
Honestly, in all the time I’ve been using my Apple Silicon machine, I’ve never experienced any issues with the difference in CPU architecture to Intel. Except here. This time. I’m embarassed to admit how many alternatives I sought to trying to get the dependencies working before I moved on from pycaret. I’m a stubborn guy.
9
Honestly, this is just habit. Not because Power Query is the best tool (especially when you’re on a Mac. I used to have to virtualise Windows in my old job) – but because it’s intuitive to use, and everyone has Excel. You can do quick cleans and data transformations. Joining multiple tables is laborious, but you can duplicate your data transformation steps. Also, if you pick up the project a year later, you can see what you did step by step. Very handy when you’re a one man data band in a fast growing startup.
10
Did I break my arm and meant a medical cast? Or am I a wizard casting a spell? No. Casting is a data transformation term. When you transform a long (or tall) dataset to a wide dataset, its called casting. The opposite (so wide to tall) is called melting. Most of my data transformation tasks prior to this project had been of the melting variety. For more on the wonderful world of data transformations, I recommend the paper Tidy Data by Hadley Wickham in the Journal of Statistical Software. (https://vita.had.co.nz/papers/tidy-data.pdf)
11
The alternative to qcut is pandas cut. This is great if for instance you want to bin a series of ages into age ranges. It’s great to just divvy up a continuous range into categories. However, it doesn’t tell you the distribution of that range, which is what qcut is good at doing. It tells you if that data is, for instance in your upper quartile, or your top 10%. Because all I wanted was to know if my data was in “high” “medium” or “low” compared to the distribution, qcut was a better choice for this exercise.
12
Or System 1 vs System 2 thinking. Reading is System 1. Reading code is System 2 thinking.
13
MacBook M1 Pro as of the time of writing.
14
If you want to go down the rabbit hole of the maths of associated rules mining, start here https://www.kaggle.com/code/gauravchopracg/introduction-to-association-rule-mining and jump to the “Association Rule Learning” subheading. For another plain english explanation of support and confidence in associated rule learning go here https://www.geeksforgeeks.org/what-is-support-and-confidence-in-data-mining/. But just remember, you don’t need to know how an engine works to drive a car.
15
I’m sorry to every statistician reading this – I had to butcher this definition otherwise this article will get very technical, very fast.
16
For you jargon lovers – I mean summary & descriptive statistics.
17
Of course correlation doesn’t equal causation. But its worth a shot. One can hypothesise that as customers fill in NPS surveys any disgruntlements can get caught before they lapse.