Project Overview & Problem
Like spotify, Sparkify is a music streaming service offering the possibility to its users to enjoy their service for free (with ads) or a monthly subscription to avoid ads while listening.
Being able to predict churn is critical for company subscription based , it can help predict customer who are most likely to leave and allow them to act accordingly before a churn to retain them.
Usually retaining existing customer is more cost effective than finding new customer
Data science can help us in anticipating customers’ demands by giving them on time targeted offers.
Sparkify, as every company, has to be economically viable, we suppose that most of their earnings come directly from the premium offer (monthly subscription).
By analyzing the transmitted historical data our goal is to detect and predict if a customer is in the process of cancelling, downgrading their account based on features communicated by sparkify.
we’ll use Spark to build a machine learning model that could solve this challenge.
The Dataset
We will be working with a small fraction (128MB) of the original data, containing 225 unique accounts and 286,500 transactions.
The reason why we will be working with such a small sample is because we don’t want to spend too long on training and testing. At the end the model could be deploy on a cluster with the 12gb files available to get less imbalanced result if interested.
Data Exploration
the chunk of data provided is in .json format with the following schema :
# | Column | type | Description |
1 | userId | string | ID of the user |
2 | artist | string | Name of the artist |
3 | auth | string | “Logged in” or “Cancelled” |
4 | firstName | string | First name of the user |
5 | gender | string | Gender of the user, “F” or “M” |
6 | itemInSession | Long Int | Item in session |
7 | lastName | string | Last name of the user |
8 | length | Double | Length of the song related to the event |
9 | level | string | Level of the user’s subscription, “free” or “paid”. User can change the level, so events for the same user can have different levels |
10 | location | string | Location of the user at the time of the event |
11 | method | string | “GET” or “PUT” |
12 | page | string | Type of action: “Next Song”, “Login”, “Thumbs Up” etc |
13 | registration | Long Int | Registration number |
14 | sessionId | Long Int | Session id |
15 | song | string | Name of the song |
16 | status | Long Int | Response status: 200, 404, 307 |
17 | ts | Long Int | Timestamp of the event |
18 | userAgent | string | Agent, which user used for the event |
‘ts’ & ‘registration’ columns are timestamps in the wrong format, once in the right format we can oberve that the sample file begin the 2018-10-01 02:01:57 and end 2 month later the 2018-12-03 02:11:16
There is 286 500 row in the json file We are spotting 3 patterns with columns with no missing value and others column where 8346 or 58392 values are missing
For the one with 8346 records missing, because it represent only 2.9% dataset, i decide to remove them.
The artist, length and song columns have the same exact number of nulls value wich mainly depends on the page feature that does not involved song.
To begin with data exploration, i’ve checked the different value of the page column
We define churn user as an user that cancel an event or not, the plot above show us that 173 user did not have access to to the cancellation event, 52 user did have access to the cancellation confirmation page.
From the churned user per gender plot we can observe that proportion of churned user per gender is approximately the same even if it seems that male user are more willing to churn.
89 male & 84 female did not churn when 32 men and 20 female did churn.
The level of the user determine if the user have a “free” or “paid” membership.
from this plot we can observe that most of the churned user are paying a membership (31 people).
lifetime is a feature that show the nulber of days after a user churn or not, it’s based on on the registration date and event timestamp.
From this boxplot we can observe that non churn user are more long term user with a median of 75 days versus 50 days for user cancelling their membership.
Windows and mac user are the most common use os per sparkify user, they tend to have the same proportion of churn user compared to linux where more than half of the user churned.
Windows 7 is the most commonly microsoft version per sparkify user followed by macos user.
Feature Engineering
# | Features | Description |
1 | UserId | user unique identification number |
2 | churn | user cancel subscription |
3 | gender | Male or female |
4 | sessionCount | Number of session of a user |
5 | minSessionTime | Minimum time session of a user |
6 | maxSessionTime | Maximum time session of a user |
7 | AvgSessionDurationMinutes | Average time session of a user |
8 | AvgSongsPerSession | average number of song listened per a user |
9 | artistCount | number of artist a user has listened to |
10 | level | whether a user pay a membership or not |
11 | SongCount | number of song a user has listened to |
12 | days_total_subscription | Number of day a user is register |
13 | Region Northeast | number user connected in the northeast |
14 | Region Midwest | number user connected in the Midwest |
15 | Region West | number user connected in the west |
16 | Region South | number user connected in the south |
17 | About | number of about page event per user |
18 | Add Friend | number of add friend event per user |
19 | Add to Playlist | number of add to playlist event per user |
20 | Downgrade | number of downgrade event per user |
21 | Error | number of error page event per user |
22 | Help | number of help page event per user |
23 | Home | number of home page event per user |
24 | LogOut | number of log out per user |
25 | NextSong | number of next song event per user |
26 | Roll Advert | number of roll advert event per user |
27 | Save Settings | number of save setting event per user |
28 | Settings | number of Settings page event per user |
29 | SubmitDowngrade | number of submit downgrade event per user |
30 | SubmitUpgrade | number of submit upgrade event per user |
31 | Thumbs down | number of thumbs down event per user |
32 | Thumbs Up | number of thumbs up event per user |
33 | Upgrade | number of upgrade event per user |
All the above feature are used to create our final dataframe, it’s composed of 1 dependent variable (churn) and 32 independent variable that will be used to operate the following supervised machine learning model :
- Logistic Regression
- SVM
- Decision Tree
- Random Forest
As seen on the churn user plot, churned user represent a small part of all user, that’s why we choose the f1 score as the measurement standard because it average between precision and recall on imbalanced classes like churn user.
the above plot show us wich features impact our model the most when predicting churn, we observe that’s :
- People showing disatisfaction trough thumbs downs, logout, downgrade are the one who will leave more easily.
- apart of these we can see that people from the midwest and west are more willing to churn
- Finally roll advertising is also a feature to consider when we are talking of churn user
- In contrast people used to sparkify for a long time are confident and not willing to churn, cancel their subscription.
Conclusion
We deployed Spark machine learning models to estimate churn for Sparkify’s users using
previous log data.
With a 74,89% f1 score, we developed a model that can detect consumers at danger of churning.
Once we had the dataset, We trained four distinct types of models and fine-tune each model, to see who’s perform the best.
At first sight result seems acceptable but we were working on a too small dataset to be statiscally significant wich was higly imbalance, to get better outcomes it could be interesting to have acces to a cluster and test our model on a larger dataset.
Finally we still manage to predict churn, wich is a good start even if result would be more accurate with more recent log data ( 2018 on this one), Sparkify could use our model everyday to improve customer satisfaction using A/B testing by offering discount or targeted message to the people who where likely to churn.
Thank you very much for the detailed work and explaination on this dataset, it helped me a lot.