Music Churn Prediction
Overview of Notebooks
For this project, the team created 3 separate Jupyter Notebooks to document its work:
1) Data Preparation / Feature Extraction Notebook: This notebook gives an overview of the project, and then takes the raw data, performs some initial exploration, and generates features for the predictive models. It also performs a brief exploratory data analysis on the feature set. The output this notebook output a .pkl file of features for the second notebook to read, which saves considerable time when building the models.
2) Predictive Modeling Notebook: This notebook reads the .pkl file, builds machine learning models to predict user churn, calculates and calibrates churn probabilities, and generates a projected economic impact of users who leave.
3) Initial Data Sourcing and Validation Notebook (HTML file): This is a static notebook (uploaded as HTML file - not intended for executing code) that documents two other aspects of the project that don't logically fit in either of the first two notebooks:
-
First, it contains the initial data extraction code used in Google BigQuery to reduce the data set from ~30GB down to ~1.6GB, to enable it to run on local machines.
-
Second, it contains some code that performs data integrity checks, validating that the items extracted in our smaller data set approximately match those in the full data set (e.g., same level of churn, the same timeframe, etc.)
Table of Contents (this notebook only)
- Project Overview
- Data Set Overview
- Initial Data Loading
- User Logs Data: Preparation and Feature Extraction
- Transaction Data: Preparation and Feature Extraction
- Joining Features and Data Manipulation
- Quick Exploratory Data Analysis
- Writing Output
1. Project Overview
This dataset is comprised of data collected by WSDM regarding a music streaming subscription available through KKBOX.
Project Goals:
The project aims to accomplish the following goals:
- Create a model to predict customer churn from usage and transaction data
- Create an economic model for retention
- Recommend a process for keeping the churn and economic retention models updated with latest information
2. Data Set Overview
The initial data set contains 24 variables (25 input variables and 1 variable to predict), these are spread across 4 tables. Additional details:
- Original format: csv
- Total Size: 31.14 GB, reduced to 1.6GB for analysis on local machines.
- User Count: 1.02 million labeled users contained in the Train table (88,544 users after reduction)
- Date Range: The data of customer usage and trasactions with the service spans 26 months, from Jan. 2015 to Feb. 2017. However, one of the data fields is initial date users joined the service, with dates ranging from 2004 to 2017.
- Balance: Approximately 6% of users in the data churned (positive labels); the remaining 94% stayed (negative labels).
Listed below are the tables and variables or features available for study:
Table: Transactions
This table contains transaction data for each user. Each row is a payment transaction.
- Data Shape: 21.5M rows X 9 columns
- Data Size: 1.6GB
Data Fields:
- Msno: User ID
- Payment_method_id: Payment Method
- Payment_plan_days: Length of plan
- Plan_list_price: Price for the plan
- Actual_amount_paid: Amount paid
- Is_auto_renew: T/F flag determining whether membership is auto-renew or not
- Transaction Date: Date of purchase
- Membership_expire_date: Expiry date
- Is_cancel: T/F flag determining whether or not the user canceled service. This field is correlated with the is_churn category, though it isn’t identical, as it also captures users who change service.
Table: User Logs
This table lists who, how, and when users used the service. Each row is a unique user-date combination.
- Data Shape: 392M rows X 9 columns
- Data Size: 29.1GB
Data Fields:
- Msno: User ID
- Date: Date of the logged activity
- Num_25: Number of songs played < 25% of song length
- Num_50: Number of songs played between 25% and 50%
- Num_75: Number of songs played between 50% and 75%
- Num_985: Number of songs played between 75% and 98.5%
- Num_100: Number of songs played between 98.5% and 100%
- Num_unq: Number of unique songs played
- Total_secs: Total seconds played
Table: Members
Demographic data on each user. Each row represents a unique user.
- Data Shape: 6.8M rows X 6 columns
- Data Size: 0.4GB
Data Fields:
- Msno: User ID
- City: City of the user
- BD: Age of the user
- Gender: Male, Female or Blank
- Registered_via: Registration method
- Registration_init_time: Initial time of registration
- Expiration_date: Expiration of membership
Table: Train
Labels of which users churned. Each row represents a unique user. * Data Shape: 1.0M rows X 2 columns * Data Size: 45MB
Data Fields:
- Msno: User ID
- Is_churn: T/F flag variable we are trying to predict.
3. Initial Data Loading
This analysis is performed in the cells below.
#Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Set initial parameter(s)
pd.set_option('display.max_rows', 200)
pd.options.display.max_columns = 2000
Loading the data indexing with the primary key (MSNO: String like/Object, represents the user)
#Load the data
members = pd.read_csv('members_filtered.csv')
transactions = pd.read_csv('transactions_filtered.csv')
user_logs = pd.read_csv('user_logs_filtered.csv')
labels = pd.read_csv('labels_filtered.csv')
#Set indices
members.set_index('msno', inplace = True)
labels.set_index('msno', inplace = True)
user_logs.head()
msno | date | num_25 | num_50 | num_75 | num_985 | num_100 | num_unq | total_secs | |
---|---|---|---|---|---|---|---|---|---|
0 | MVODUEUlSocm1sXa+zVGpJazPrRFiD4IzEQk0QCdg4U= | 20170217 | 37 | 2 | 2 | 3 | 30 | 66 | 9022.818 |
1 | o3Dg7baW8dXq7Jq7NzlVrWG4mZNVvqp62oWBDO/ybeE= | 20160209 | 36 | 5 | 2 | 3 | 48 | 71 | 13895.453 |
2 | 6ERcO7aqAKvrQ2CAvah79dVC7tJVZSjNti1MBfpNVW4= | 20151210 | 26 | 9 | 3 | 0 | 51 | 54 | 13919.805 |
3 | Xt9VAHNtHuST21tkcZSnGKjwv8vF8/COnsf6z28+fKk= | 20161025 | 22 | 8 | 4 | 2 | 49 | 75 | 15147.842 |
4 | zSgTJqoosTiFF7ZZi1DPTHgxLbnd99IgOEsTIDCcZHc= | 20160904 | 26 | 3 | 1 | 0 | 39 | 60 | 10558.829 |
Performing a quick inspection of the data:
print('Transactions: \n')
transactions.info()
print('User Logs: \n')
user_logs.info()
print('Members: \n')
members.info()
print('User Logs:')
labels.info()
Transactions:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353459 entries, 0 to 1353458
Data columns (total 9 columns):
msno 1353459 non-null object
payment_method_id 1353459 non-null int64
payment_plan_days 1353459 non-null int64
plan_list_price 1353459 non-null int64
actual_amount_paid 1353459 non-null int64
is_auto_renew 1353459 non-null int64
transaction_date 1353459 non-null int64
membership_expire_date 1353459 non-null int64
is_cancel 1353459 non-null int64
dtypes: int64(8), object(1)
memory usage: 92.9+ MB
User Logs:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19710631 entries, 0 to 19710630
Data columns (total 9 columns):
msno object
date int64
num_25 int64
num_50 int64
num_75 int64
num_985 int64
num_100 int64
num_unq int64
total_secs float64
dtypes: float64(1), int64(7), object(1)
memory usage: 1.3+ GB
Members:
<class 'pandas.core.frame.DataFrame'>
Index: 89473 entries, mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= to EFbHYa9/MiKYiyrl05cZ34Cky0FDeHxTYij0pXwkr2A=
Data columns (total 5 columns):
city 89473 non-null int64
bd 89473 non-null int64
gender 46137 non-null object
registered_via 89473 non-null int64
registration_init_time 89473 non-null int64
dtypes: int64(4), object(1)
memory usage: 4.1+ MB
User Logs:
<class 'pandas.core.frame.DataFrame'>
Index: 99825 entries, 3lh94wH+UPK7ENgnA5svzFMYfJJRMZHU/WjgvhRJPzc= to DIgxCOJBeanFdqLOOPMTzwwkqgREVG+g1pwfY5LWvC4=
Data columns (total 1 columns):
is_churn 99825 non-null int64
dtypes: int64(1)
memory usage: 1.5+ MB
Helper routine to format the date for visualization:
def pd_to_date(df_col):
"""Function to convert a pandas dataframe column from %Y%m%d format to datetime format.
Args:
df_col (column in a pandas dataframe): The column to be changed.
Returns:
The same column in datetime format.
"""
df_col = pd.to_datetime(df_col, format = '%Y%m%d')
return df_col
#Convert date column to date format
user_logs['date'] = pd_to_date(user_logs['date'])
The next two sections prepare the 2 major data tables/frames (User Logs & Transactions) independently and then bring them together for analysis.
4. User Logs Data: Preparation and Feature Extraction
We first create our groupby object to ultimately aggregate data by users:
#Create our groupby user object
user_logs_gb = user_logs.groupby(['msno'], sort=False)
The next cell creates three new columns:
- max_date: The latest date each user has a transaction
- days_before_max_date: The the number of days between the max date and the date of the current record.
- listening_tenure: The the number of days between the max date and min date of the current user. The hypothesis for this feature is that a user who's been using the service for a long time may be less likely to churn than one who's been using the service for a short time.
#Append max date to every row in main table
user_logs['max_date'] = user_logs_gb['date'].transform('max')
user_logs['days_before_max_date'] = (user_logs['max_date'] - user_logs['date']).apply(lambda x: x.days)
#The .apply(lambda... just converts it from datetime to an integer, for easier comparisons later.
#Generate user's first date, last date, and tenure
#Also, the user_logs_features table will be the primary table to return from the transactions table
user_logs_features = (user_logs_gb
.agg({'date':['max', 'min', lambda x: (max(x) - min(x)).days]}) #.days converts to int
.rename(columns={'max': 'max_date', 'min': 'min_date','<lambda>':'listening_tenure'})
)
#Add a 3rd level, used for joining data later
user_logs_features = pd.concat([user_logs_features], axis=1, keys=['date_features'])
Let's take a look at our initial users table:
user_logs_features.head()
date_features | |||
---|---|---|---|
date | |||
max_date | min_date | listening_tenure | |
msno | |||
MVODUEUlSocm1sXa+zVGpJazPrRFiD4IzEQk0QCdg4U= | 2017-02-27 | 2015-07-11 | 597 |
o3Dg7baW8dXq7Jq7NzlVrWG4mZNVvqp62oWBDO/ybeE= | 2017-02-07 | 2015-03-10 | 700 |
6ERcO7aqAKvrQ2CAvah79dVC7tJVZSjNti1MBfpNVW4= | 2017-02-17 | 2015-01-01 | 778 |
Xt9VAHNtHuST21tkcZSnGKjwv8vF8/COnsf6z28+fKk= | 2017-02-28 | 2016-09-08 | 173 |
zSgTJqoosTiFF7ZZi1DPTHgxLbnd99IgOEsTIDCcZHc= | 2017-02-13 | 2015-01-01 | 774 |
We now create features to look at patters of usage over the past X days, where X is days_before_max_date, to see what a user has been doing "lately". We apply this rationale to all of the usage columns in the user_logs table, giving us combinations of the following elements of our data:
- Number of songs played < Y% of song length, where Y is 100, 985, 75, 50, and 25, plus the number of unique songs and total seconds played.
- Activity over the last day, last 7, 30, 90, 180, 365, and total days, noting that the date range is relative to user's most recent activity.
For each of these combinations, we calulate (using groupby and aggregate) both the sum and mean of each feature. And finally we also create a single, total count column (number of rows) for the past number of days. In total, this generates 120 features, which we then append to the user_logs_features table above.
#Create Features:
# Total X=(seconds, 100, 985, 75, 50, 25, unique), avg per day of X, maybe median per day of X
# Last day, last 7 days, last 30 days, last 90, 180, 365, total (note last day is relative to user)
for num_days in [1, 7, 14, 31, 90, 180, 365, 9999]:
#Create groupby object for items with x days
ul_gb_xdays = (user_logs.loc[(user_logs['days_before_max_date'] < num_days)]
.groupby(['msno'], sort=False))
#Generate sum and mean (and count, once) for all the user logs stats
past_xdays_by_user = (ul_gb_xdays
.agg({'num_unq':['sum', 'mean', 'count'],
'total_secs':['sum', 'mean'],
'num_25':['sum', 'mean'],
'num_50':['sum', 'mean'],
'num_75':['sum', 'mean'],
'num_985':['sum', 'mean'],
'num_100':['sum', 'mean'],
})
)
#Append level header
past_xdays_by_user = pd.concat([past_xdays_by_user], axis=1, keys=['within_days_' + str(num_days)])
#Join (append) to user_logs_features table
user_logs_features = user_logs_features.join(past_xdays_by_user, how='inner')
Taking a quick look at our table now:
user_logs_features.head()
date_features | within_days_1 | within_days_7 | within_days_14 | within_days_31 | within_days_90 | within_days_180 | within_days_365 | within_days_9999 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | num_unq | total_secs | num_25 | num_50 | num_75 | num_985 | num_100 | num_unq | total_secs | num_25 | num_50 | num_75 | num_985 | num_100 | num_unq | total_secs | num_25 | num_50 | num_75 | num_985 | num_100 | num_unq | total_secs | num_25 | num_50 | num_75 | num_985 | num_100 | num_unq | total_secs | num_25 | num_50 | num_75 | num_985 | num_100 | num_unq | total_secs | num_25 | num_50 | num_75 | num_985 | num_100 | num_unq | total_secs | num_25 | num_50 | num_75 | num_985 | num_100 | num_unq | total_secs | num_25 | num_50 | num_75 | num_985 | num_100 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
max_date | min_date | listening_tenure | sum | mean | count | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | count | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | count | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | count | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | count | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | count | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | count | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | count | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | |
msno | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MVODUEUlSocm1sXa+zVGpJazPrRFiD4IzEQk0QCdg4U= | 2017-02-27 | 2015-07-11 | 597 | 17 | 17 | 1 | 29802.123 | 29802.123 | 17 | 17 | 1 | 1 | 0 | 0 | 1 | 1 | 115 | 115 | 161 | 26.833333 | 6 | 72502.206 | 12083.70100 | 99 | 16.500000 | 13 | 2.166667 | 11 | 1.833333 | 9 | 1.50 | 275 | 45.833333 | 478 | 39.833333 | 12 | 154157.897 | 12846.491417 | 207 | 17.250000 | 24 | 2.000000 | 16 | 1.333333 | 19 | 1.583333 | 595 | 49.583333 | 1040 | 41.600000 | 25 | 299571.033 | 11982.841320 | 448 | 17.920000 | 53 | 2.120000 | 27 | 1.080000 | 43 | 1.720000 | 1153 | 46.120000 | 2723 | 43.919355 | 62 | 610185.373 | 9841.699565 | 1227 | 19.790323 | 222 | 3.580645 | 103 | 1.661290 | 132 | 2.129032 | 2244 | 36.193548 | 5805 | 47.975207 | 121 | 1195064.265 | 9876.564174 | 2792 | 23.074380 | 543 | 4.487603 | 261 | 2.157025 | 252 | 2.082645 | 4267 | 35.264463 | 10396 | 45.004329 | 231 | 1989225.131 | 8611.364203 | 4768 | 20.640693 | 1243 | 5.380952 | 513 | 2.220779 | 479 | 2.073593 | 6908 | 29.904762 | 17549 | 46.181579 | 380 | 3134336.415 | 8248.253724 | 10198 | 26.836842 | 2395 | 6.302632 | 972 | 2.557895 | 806 | 2.121053 | 10495 | 27.618421 |
o3Dg7baW8dXq7Jq7NzlVrWG4mZNVvqp62oWBDO/ybeE= | 2017-02-07 | 2015-03-10 | 700 | 1 | 1 | 1 | 274.176 | 274.176 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 161 | 26.833333 | 6 | 40311.015 | 6718.50250 | 52 | 8.666667 | 7 | 1.166667 | 7 | 1.166667 | 15 | 2.50 | 137 | 22.833333 | 253 | 25.300000 | 10 | 65631.235 | 6563.123500 | 84 | 8.400000 | 16 | 1.600000 | 16 | 1.600000 | 25 | 2.500000 | 218 | 21.800000 | 811 | 38.619048 | 21 | 201286.586 | 9585.075524 | 236 | 11.238095 | 43 | 2.047619 | 38 | 1.809524 | 53 | 2.523810 | 699 | 33.285714 | 1428 | 33.209302 | 43 | 408254.891 | 9494.299791 | 460 | 10.697674 | 78 | 1.813953 | 69 | 1.604651 | 101 | 2.348837 | 1430 | 33.255814 | 1648 | 32.313725 | 51 | 487842.759 | 9565.544294 | 495 | 9.705882 | 92 | 1.803922 | 76 | 1.490196 | 105 | 2.058824 | 1735 | 34.019608 | 3255 | 29.062500 | 112 | 968253.587 | 8645.121312 | 1114 | 9.946429 | 206 | 1.839286 | 164 | 1.464286 | 230 | 2.053571 | 3430 | 30.625000 | 7475 | 25.775862 | 290 | 2826851.820 | 9747.764897 | 3059 | 10.548276 | 692 | 2.386207 | 535 | 1.844828 | 567 | 1.955172 | 10075 | 34.741379 |
6ERcO7aqAKvrQ2CAvah79dVC7tJVZSjNti1MBfpNVW4= | 2017-02-17 | 2015-01-01 | 778 | 13 | 13 | 1 | 10363.972 | 10363.972 | 5 | 5 | 1 | 1 | 1 | 1 | 4 | 4 | 41 | 41 | 219 | 43.800000 | 5 | 68118.548 | 13623.70960 | 25 | 5.000000 | 6 | 1.200000 | 6 | 1.200000 | 7 | 1.40 | 289 | 57.800000 | 480 | 48.000000 | 10 | 121094.373 | 12109.437300 | 83 | 8.300000 | 24 | 2.400000 | 9 | 0.900000 | 11 | 1.100000 | 498 | 49.800000 | 870 | 43.500000 | 20 | 210674.360 | 10533.718000 | 190 | 9.500000 | 37 | 1.850000 | 20 | 1.000000 | 28 | 1.400000 | 827 | 41.350000 | 2390 | 44.259259 | 54 | 604775.769 | 11199.551278 | 490 | 9.074074 | 116 | 2.148148 | 79 | 1.462963 | 92 | 1.703704 | 2318 | 42.925926 | 4427 | 41.764151 | 106 | 1186775.568 | 11195.995925 | 750 | 7.075472 | 212 | 2.000000 | 143 | 1.349057 | 179 | 1.688679 | 4631 | 43.688679 | 9683 | 41.917749 | 231 | 2677487.134 | 11590.853394 | 1323 | 5.727273 | 425 | 1.839827 | 340 | 1.471861 | 396 | 1.714286 | 10499 | 45.450216 | 17589 | 35.605263 | 494 | 4836694.885 | 9790.880334 | 2799 | 5.665992 | 1023 | 2.070850 | 772 | 1.562753 | 770 | 1.558704 | 18443 | 37.334008 |
Xt9VAHNtHuST21tkcZSnGKjwv8vF8/COnsf6z28+fKk= | 2017-02-28 | 2016-09-08 | 173 | 86 | 86 | 1 | 21094.770 | 21094.770 | 10 | 10 | 6 | 6 | 7 | 7 | 4 | 4 | 72 | 72 | 133 | 26.600000 | 5 | 32936.237 | 6587.24740 | 18 | 3.600000 | 8 | 1.600000 | 7 | 1.400000 | 8 | 1.60 | 112 | 22.400000 | 313 | 31.300000 | 10 | 72406.047 | 7240.604700 | 86 | 8.600000 | 18 | 1.800000 | 11 | 1.100000 | 13 | 1.300000 | 251 | 25.100000 | 493 | 21.434783 | 23 | 114170.059 | 4963.915609 | 126 | 5.478261 | 27 | 1.173913 | 16 | 0.695652 | 16 | 0.695652 | 395 | 17.173913 | 2060 | 34.915254 | 59 | 406402.278 | 6888.174203 | 781 | 13.237288 | 116 | 1.966102 | 75 | 1.271186 | 69 | 1.169492 | 1404 | 23.796610 | 3840 | 37.281553 | 103 | 799700.121 | 7764.078845 | 1387 | 13.466019 | 219 | 2.126214 | 159 | 1.543689 | 168 | 1.631068 | 2696 | 26.174757 | 3840 | 37.281553 | 103 | 799700.121 | 7764.078845 | 1387 | 13.466019 | 219 | 2.126214 | 159 | 1.543689 | 168 | 1.631068 | 2696 | 26.174757 | 3840 | 37.281553 | 103 | 799700.121 | 7764.078845 | 1387 | 13.466019 | 219 | 2.126214 | 159 | 1.543689 | 168 | 1.631068 | 2696 | 26.174757 |
zSgTJqoosTiFF7ZZi1DPTHgxLbnd99IgOEsTIDCcZHc= | 2017-02-13 | 2015-01-01 | 774 | 22 | 22 | 1 | 2803.117 | 2803.117 | 9 | 9 | 0 | 0 | 3 | 3 | 0 | 0 | 10 | 10 | 50 | 12.500000 | 4 | 7935.679 | 1983.91975 | 17 | 4.250000 | 2 | 0.500000 | 3 | 0.750000 | 1 | 0.25 | 31 | 7.750000 | 101 | 14.428571 | 7 | 17605.267 | 2515.038143 | 41 | 5.857143 | 5 | 0.714286 | 4 | 0.571429 | 1 | 0.142857 | 71 | 10.142857 | 301 | 21.500000 | 14 | 40235.862 | 2873.990143 | 148 | 10.571429 | 15 | 1.071429 | 9 | 0.642857 | 7 | 0.500000 | 152 | 10.857143 | 759 | 23.000000 | 33 | 124239.481 | 3764.832758 | 284 | 8.606061 | 28 | 0.848485 | 21 | 0.636364 | 24 | 0.727273 | 466 | 14.121212 | 2722 | 32.023529 | 85 | 461324.690 | 5427.349294 | 1185 | 13.941176 | 96 | 1.129412 | 70 | 0.823529 | 123 | 1.447059 | 1712 | 20.141176 | 9576 | 46.712195 | 205 | 1931032.095 | 9419.668756 | 3431 | 16.736585 | 401 | 1.956098 | 278 | 1.356098 | 376 | 1.834146 | 7359 | 35.897561 | 16913 | 57.921233 | 292 | 3583828.437 | 12273.385058 | 5126 | 17.554795 | 591 | 2.023973 | 440 | 1.506849 | 559 | 1.914384 | 13815 | 47.311644 |
Good, we get the expected number of columns.
5. Transaction Data: Preparation and Feature Extraction
Having completed feature extraction for user logs, we now move on to creating features for the transaction data.
We begin grouping the data by user.
# Grouping by the member (msno)
transactions_gb = transactions.sort_values(["transaction_date"]).groupby(['msno'])
# How many groups i.e. members i.e. msno's. We're good if this is the same number as the members table
print('%d Groups/msnos' %(len(transactions_gb.groups)))
print('%d Features' %(len(transactions.columns)))
99825 Groups/msnos
9 Features
We plan to create the following features from the transactions table: * Simple featuers from the latest transaction * Plan no of days * plan total amount paid * plan list price * Is_auto_renew * is_cancel * Synthetic features from the latest transaction * Plan actual amount paid/day * Aggregate values * Total number of plan days * Total of all the amounts paid for the plan * Comparing transactions * Plan day difference among the latest and previous transaction * Amount paid/day difference among the latest and previous transaction
We begin by creating the total_plan_days and total_amount_paid:
# Features: Total_plan_days, Total_amount_paid
transactions_features = (transactions_gb
.agg({'payment_plan_days':'sum', 'actual_amount_paid':'sum' })
.rename(columns={'payment_plan_days': 'total_plan_days', 'actual_amount_paid': 'total_amount_paid',})
)
print('%d Entries in the DF: ' %(len(transactions_features)))
print('%d Features' %(len(transactions_features.columns)))
transactions_features.head()
99825 Entries in the DF:
2 Features
total_plan_days | total_amount_paid | |
---|---|---|
msno | ||
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw= | 543 | 2831 |
++5nv+2nsvrWM7dOT+ZiWJ5uTZOzQS0NEvqu3jidTjU= | 90 | 297 |
++7IULiyKbNc8jllqhRuyKZjX1J4mPF4tsudFCJfv4k= | 513 | 2682 |
++Ck01c3EF07Ejek2jfXlKut+sEfg+0ry+A5uWeL9vY= | 270 | 891 |
++FPL1dXZBXC3Cf6gE0HQiIHg1Pd+DBdK7w52xcUmX0= | 457 | 2235 |
Next, we add amount_paid_per_day for a user's entire tenure:
# Plan actual amount paid/day for all the transactions by a user
# Adding the collumn amount_paid_per_day
transactions_features['amount_paid_per_day'] = (transactions_features['total_amount_paid']
/transactions_features['total_plan_days'])
print('%d Entries in the DF: ' %(len(transactions_features)))
print('%d Features' %(len(transactions_features.columns)))
transactions_features.head()
99825 Entries in the DF:
3 Features
total_plan_days | total_amount_paid | amount_paid_per_day | |
---|---|---|---|
msno | |||
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw= | 543 | 2831 | 5.213628 |
++5nv+2nsvrWM7dOT+ZiWJ5uTZOzQS0NEvqu3jidTjU= | 90 | 297 | 3.300000 |
++7IULiyKbNc8jllqhRuyKZjX1J4mPF4tsudFCJfv4k= | 513 | 2682 | 5.228070 |
++Ck01c3EF07Ejek2jfXlKut+sEfg+0ry+A5uWeL9vY= | 270 | 891 | 3.300000 |
++FPL1dXZBXC3Cf6gE0HQiIHg1Pd+DBdK7w52xcUmX0= | 457 | 2235 | 4.890591 |
Next, we add latest_payment_method_id, latest_plan_days, latest_plan_list_price, latest_amount_paid, latest_auto_renew, latest_transaction_date, latest_expire_date, and latest_is_cancel. We accomplish this by picking from the bottom of the ordered (by date) rows in groups.
# Features: latest transaction, renaming the collumns
# V1- Fixed the name for plan_list_price collumn (now called latest_plan_list_price)
latest_transaction= transactions_gb.tail([1]).rename(columns={'payment_method_id': 'latest_payment_method_id',
'payment_plan_days': 'latest_plan_days',
'plan_list_price': 'latest_plan_list_price',
'actual_amount_paid': 'latest_amount_paid',
'is_auto_renew': 'latest_auto_renew',
'transaction_date': 'latest_transaction_date',
'membership_expire_date': 'latest_expire_date',
'is_cancel': 'latest_is_cancel' })
# Index by msno
latest_transaction.set_index('msno', inplace = True)
print('%d Entries in the DF: ' %(len(latest_transaction)))
print('%d Features' %(len(latest_transaction.columns)))
latest_transaction.head()
99825 Entries in the DF:
8 Features
latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_transaction_date | latest_expire_date | latest_is_cancel | |
---|---|---|---|---|---|---|---|---|
msno | ||||||||
z1Lm/BlRQraiaWJ7RaQWe0+l0Z40ACj7W+zk29FiaS4= | 38 | 30 | 149 | 149 | 0 | 20150102 | 20150503 | 0 |
IwE/pih8PuqrY/rsnoZ/4TazDliyH9S8VWNc2/d7mJg= | 38 | 30 | 149 | 149 | 0 | 20150102 | 20150702 | 0 |
ea9rY0uEPY0ImD2QVbYFb+z3zi5wniKWMUM1V8os7OY= | 32 | 410 | 1788 | 1788 | 0 | 20150104 | 20170213 | 0 |
plhzwjmNJp0HW04NidfVa35JE216RaFYpSeUCwT11zQ= | 38 | 30 | 149 | 149 | 0 | 20150120 | 20170103 | 0 |
PbSQ2KxR4gRnzjsRd8Up75qMYb70iuMwGk10/jPRljk= | 38 | 360 | 1200 | 1200 | 0 | 20150123 | 20170212 | 0 |
Next, we add latest_amount_paid_per_day:
# Plan actual amount paid/day for the latest transaction
# Adding the collumn amount_paid_per_day
latest_transaction['latest_amount_paid_per_day'] = (latest_transaction['latest_amount_paid']
/latest_transaction['latest_plan_days'])
print('%d Entries in the DF: ' %(len(latest_transaction)))
print('%d Features' %(len(latest_transaction.columns)))
latest_transaction.head()
99825 Entries in the DF:
9 Features
latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_transaction_date | latest_expire_date | latest_is_cancel | latest_amount_paid_per_day | |
---|---|---|---|---|---|---|---|---|---|
msno | |||||||||
z1Lm/BlRQraiaWJ7RaQWe0+l0Z40ACj7W+zk29FiaS4= | 38 | 30 | 149 | 149 | 0 | 20150102 | 20150503 | 0 | 4.966667 |
IwE/pih8PuqrY/rsnoZ/4TazDliyH9S8VWNc2/d7mJg= | 38 | 30 | 149 | 149 | 0 | 20150102 | 20150702 | 0 | 4.966667 |
ea9rY0uEPY0ImD2QVbYFb+z3zi5wniKWMUM1V8os7OY= | 32 | 410 | 1788 | 1788 | 0 | 20150104 | 20170213 | 0 | 4.360976 |
plhzwjmNJp0HW04NidfVa35JE216RaFYpSeUCwT11zQ= | 38 | 30 | 149 | 149 | 0 | 20150120 | 20170103 | 0 | 4.966667 |
PbSQ2KxR4gRnzjsRd8Up75qMYb70iuMwGk10/jPRljk= | 38 | 360 | 1200 | 1200 | 0 | 20150123 | 20170212 | 0 | 3.333333 |
Next, we compare two different items in our transaction data:
- Plan duration difference among the last 2 transactons
- Cost difference among the last 2 transactions
# Getting the 2 latest transactions and grouping by msno again
latest_transaction2_gb = transactions_gb.tail([2]).groupby(['msno'])
# Getting the latest but one transaction
latest2_transaction = latest_transaction2_gb.head([1])
# Index by msno
latest2_transaction.set_index('msno', inplace = True)
# Amount paid per day for the 2nd latest transaction
latest2_transaction['latest2_amount_paid_per_day'] = (latest2_transaction['actual_amount_paid']
/latest2_transaction['payment_plan_days'])
# Difference in the renewal length among the latest 2 transactions
transactions_features['diff_renewal_duration'] = (latest_transaction['latest_plan_days']
- latest2_transaction['payment_plan_days'])
# Different in plan cost among the latest 2 transactions
transactions_features['diff_plan_amount_paid_per_day'] = (latest_transaction['latest_amount_paid_per_day']
- latest2_transaction['latest2_amount_paid_per_day'])
print('%d Entries in the DF: ' %(len(transactions_features)))
print('%d Features' %(len(transactions_features.columns)))
transactions_features.head()
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if sys.path[0] == '':
99825 Entries in the DF:
5 Features
total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | |
---|---|---|---|---|---|
msno | |||||
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw= | 543 | 2831 | 5.213628 | 0 | 0.000000 |
++5nv+2nsvrWM7dOT+ZiWJ5uTZOzQS0NEvqu3jidTjU= | 90 | 297 | 3.300000 | 0 | 0.000000 |
++7IULiyKbNc8jllqhRuyKZjX1J4mPF4tsudFCJfv4k= | 513 | 2682 | 5.228070 | 0 | 0.000000 |
++Ck01c3EF07Ejek2jfXlKut+sEfg+0ry+A5uWeL9vY= | 270 | 891 | 3.300000 | 0 | 0.000000 |
++FPL1dXZBXC3Cf6gE0HQiIHg1Pd+DBdK7w52xcUmX0= | 457 | 2235 | 4.890591 | 23 | 4.966667 |
Finally, we join all the features into a single data frame:
# Get all transaction features in a single DF
transactions_features = transactions_features.join(latest_transaction, how = 'inner')
# Test
print('%d Entries in the DF: ' %(len(transactions_features)))
print('%d Features' %(len(transactions_features.columns)))
transactions_features.head()
99825 Entries in the DF:
14 Features
total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_transaction_date | latest_expire_date | latest_is_cancel | latest_amount_paid_per_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | ||||||||||||||
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw= | 543 | 2831 | 5.213628 | 0 | 0.000000 | 39 | 30 | 149 | 149 | 1 | 20170131 | 20170319 | 0 | 4.966667 |
++5nv+2nsvrWM7dOT+ZiWJ5uTZOzQS0NEvqu3jidTjU= | 90 | 297 | 3.300000 | 0 | 0.000000 | 41 | 30 | 99 | 99 | 1 | 20170201 | 20170301 | 0 | 3.300000 |
++7IULiyKbNc8jllqhRuyKZjX1J4mPF4tsudFCJfv4k= | 513 | 2682 | 5.228070 | 0 | 0.000000 | 37 | 30 | 149 | 149 | 1 | 20170201 | 20170301 | 0 | 4.966667 |
++Ck01c3EF07Ejek2jfXlKut+sEfg+0ry+A5uWeL9vY= | 270 | 891 | 3.300000 | 0 | 0.000000 | 41 | 30 | 99 | 99 | 1 | 20170214 | 20170314 | 0 | 3.300000 |
++FPL1dXZBXC3Cf6gE0HQiIHg1Pd+DBdK7w52xcUmX0= | 457 | 2235 | 4.890591 | 23 | 4.966667 | 41 | 30 | 149 | 149 | 1 | 20160225 | 20160225 | 1 | 4.966667 |
6. Joining Features and Data Manipulation
Joining Features
Having completed features by user from the User Logs and Transactions tables, we will now join the features from these tables together with the Members and Labels (a.k.a., train) tables into a single data frame for predictive modeling.
First, we'll join the Members and Labels together:
#Join members and labels files
df_fa = None
df_fa = members.join(labels, how='inner')
df_fa.head()
city | bd | gender | registered_via | registration_init_time | is_churn | |
---|---|---|---|---|---|---|
msno | ||||||
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= | 1 | 0 | NaN | 13 | 20170120 | 0 |
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= | 1 | 0 | NaN | 13 | 20160907 | 0 |
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= | 1 | 0 | NaN | 13 | 20160902 | 0 |
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= | 1 | 0 | NaN | 13 | 20161028 | 0 |
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= | 1 | 0 | NaN | 13 | 20161004 | 0 |
Next, we join the User Logs features table with the combined Members and the Labels table:
df_fa = df_fa.join(user_logs_features, how='inner')
#Note, the warning is okay, and actually helps us by flattening our column headers.
df_fa.head()
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:558: UserWarning: merging between different levels can give an unintended result (1 levels on the left, 3 on the right)
warnings.warn(msg, UserWarning)
city | bd | gender | registered_via | registration_init_time | is_churn | (date_features, date, max_date) | (date_features, date, min_date) | (date_features, date, listening_tenure) | (within_days_1, num_unq, sum) | (within_days_1, num_unq, mean) | (within_days_1, num_unq, count) | (within_days_1, total_secs, sum) | (within_days_1, total_secs, mean) | (within_days_1, num_25, sum) | (within_days_1, num_25, mean) | (within_days_1, num_50, sum) | (within_days_1, num_50, mean) | (within_days_1, num_75, sum) | (within_days_1, num_75, mean) | (within_days_1, num_985, sum) | (within_days_1, num_985, mean) | (within_days_1, num_100, sum) | (within_days_1, num_100, mean) | (within_days_7, num_unq, sum) | (within_days_7, num_unq, mean) | (within_days_7, num_unq, count) | (within_days_7, total_secs, sum) | (within_days_7, total_secs, mean) | (within_days_7, num_25, sum) | (within_days_7, num_25, mean) | (within_days_7, num_50, sum) | (within_days_7, num_50, mean) | (within_days_7, num_75, sum) | (within_days_7, num_75, mean) | (within_days_7, num_985, sum) | (within_days_7, num_985, mean) | (within_days_7, num_100, sum) | (within_days_7, num_100, mean) | (within_days_14, num_unq, sum) | (within_days_14, num_unq, mean) | (within_days_14, num_unq, count) | (within_days_14, total_secs, sum) | (within_days_14, total_secs, mean) | (within_days_14, num_25, sum) | (within_days_14, num_25, mean) | (within_days_14, num_50, sum) | (within_days_14, num_50, mean) | (within_days_14, num_75, sum) | (within_days_14, num_75, mean) | (within_days_14, num_985, sum) | (within_days_14, num_985, mean) | (within_days_14, num_100, sum) | (within_days_14, num_100, mean) | (within_days_31, num_unq, sum) | (within_days_31, num_unq, mean) | (within_days_31, num_unq, count) | (within_days_31, total_secs, sum) | (within_days_31, total_secs, mean) | (within_days_31, num_25, sum) | (within_days_31, num_25, mean) | (within_days_31, num_50, sum) | (within_days_31, num_50, mean) | (within_days_31, num_75, sum) | (within_days_31, num_75, mean) | (within_days_31, num_985, sum) | (within_days_31, num_985, mean) | (within_days_31, num_100, sum) | (within_days_31, num_100, mean) | (within_days_90, num_unq, sum) | (within_days_90, num_unq, mean) | (within_days_90, num_unq, count) | (within_days_90, total_secs, sum) | (within_days_90, total_secs, mean) | (within_days_90, num_25, sum) | (within_days_90, num_25, mean) | (within_days_90, num_50, sum) | (within_days_90, num_50, mean) | (within_days_90, num_75, sum) | (within_days_90, num_75, mean) | (within_days_90, num_985, sum) | (within_days_90, num_985, mean) | (within_days_90, num_100, sum) | (within_days_90, num_100, mean) | (within_days_180, num_unq, sum) | (within_days_180, num_unq, mean) | (within_days_180, num_unq, count) | (within_days_180, total_secs, sum) | (within_days_180, total_secs, mean) | (within_days_180, num_25, sum) | (within_days_180, num_25, mean) | (within_days_180, num_50, sum) | (within_days_180, num_50, mean) | (within_days_180, num_75, sum) | (within_days_180, num_75, mean) | (within_days_180, num_985, sum) | (within_days_180, num_985, mean) | (within_days_180, num_100, sum) | (within_days_180, num_100, mean) | (within_days_365, num_unq, sum) | (within_days_365, num_unq, mean) | (within_days_365, num_unq, count) | (within_days_365, total_secs, sum) | (within_days_365, total_secs, mean) | (within_days_365, num_25, sum) | (within_days_365, num_25, mean) | (within_days_365, num_50, sum) | (within_days_365, num_50, mean) | (within_days_365, num_75, sum) | (within_days_365, num_75, mean) | (within_days_365, num_985, sum) | (within_days_365, num_985, mean) | (within_days_365, num_100, sum) | (within_days_365, num_100, mean) | (within_days_9999, num_unq, sum) | (within_days_9999, num_unq, mean) | (within_days_9999, num_unq, count) | (within_days_9999, total_secs, sum) | (within_days_9999, total_secs, mean) | (within_days_9999, num_25, sum) | (within_days_9999, num_25, mean) | (within_days_9999, num_50, sum) | (within_days_9999, num_50, mean) | (within_days_9999, num_75, sum) | (within_days_9999, num_75, mean) | (within_days_9999, num_985, sum) | (within_days_9999, num_985, mean) | (within_days_9999, num_100, sum) | (within_days_9999, num_100, mean) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= | 1 | 0 | NaN | 13 | 20170120 | 0 | 2017-02-24 | 2017-01-20 | 35 | 1 | 1 | 1 | 10.068 | 10.068 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 27 | 6.750000 | 4 | 3158.450 | 789.612500 | 33 | 8.250000 | 1 | 0.250000 | 1 | 0.250000 | 1 | 0.250000 | 9 | 2.250000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 |
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= | 1 | 0 | NaN | 13 | 20160907 | 0 | 2017-02-27 | 2016-09-07 | 173 | 21 | 21 | 1 | 2633.631 | 2633.631 | 13 | 13 | 3 | 3 | 1 | 1 | 1 | 1 | 8 | 8 | 228 | 32.571429 | 7 | 32731.138 | 4675.876857 | 140 | 20.000000 | 29 | 4.142857 | 14 | 2.000000 | 20 | 2.857143 | 95 | 13.571429 | 512 | 36.571429 | 14 | 98422.408 | 7030.172000 | 301 | 21.500000 | 71 | 5.071429 | 34 | 2.428571 | 42 | 3.000000 | 305 | 21.785714 | 1044 | 36.000000 | 29 | 178909.861 | 6169.305552 | 656 | 22.620690 | 135 | 4.655172 | 61 | 2.103448 | 74 | 2.551724 | 571 | 19.689655 | 4298 | 58.876712 | 73 | 632743.845 | 8667.723904 | 2717 | 37.219178 | 393 | 5.383562 | 188 | 2.575342 | 189 | 2.589041 | 2094 | 28.684932 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 |
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= | 1 | 0 | NaN | 13 | 20160902 | 0 | 2017-02-26 | 2016-09-02 | 177 | 1 | 1 | 1 | 271.093 | 271.093 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 243 | 34.714286 | 7 | 60581.740 | 8654.534286 | 3 | 0.428571 | 1 | 0.142857 | 2 | 0.285714 | 1 | 0.142857 | 238 | 34.000000 | 489 | 34.928571 | 14 | 122772.792 | 8769.485143 | 32 | 2.285714 | 2 | 0.142857 | 5 | 0.357143 | 11 | 0.785714 | 476 | 34.000000 | 899 | 32.107143 | 28 | 231622.820 | 8272.243571 | 73 | 2.607143 | 7 | 0.250000 | 11 | 0.392857 | 14 | 0.500000 | 893 | 31.892857 | 2396 | 35.235294 | 68 | 770040.608 | 11324.126588 | 180 | 2.647059 | 32 | 0.470588 | 32 | 0.470588 | 33 | 0.485294 | 2953 | 43.426471 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 |
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= | 1 | 0 | NaN | 13 | 20161028 | 0 | 2017-02-28 | 2016-10-28 | 123 | 17 | 17 | 1 | 1626.704 | 1626.704 | 15 | 15 | 1 | 1 | 1 | 1 | 0 | 0 | 6 | 6 | 121 | 24.200000 | 5 | 30054.147 | 6010.829400 | 29 | 5.800000 | 10 | 2.000000 | 5 | 1.000000 | 2 | 0.400000 | 123 | 24.600000 | 192 | 17.454545 | 11 | 43518.795 | 3956.254091 | 44 | 4.000000 | 11 | 1.000000 | 7 | 0.636364 | 7 | 0.636364 | 174 | 15.818182 | 457 | 17.576923 | 26 | 111841.140 | 4301.582308 | 80 | 3.076923 | 23 | 0.884615 | 15 | 0.576923 | 15 | 0.576923 | 456 | 17.538462 | 1229 | 21.189655 | 58 | 287422.839 | 4955.566190 | 203 | 3.500000 | 81 | 1.396552 | 55 | 0.948276 | 60 | 1.034483 | 1145 | 19.741379 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 |
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= | 1 | 0 | NaN | 13 | 20161004 | 0 | 2016-10-26 | 2016-10-04 | 22 | 1 | 1 | 1 | 156.204 | 156.204 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 14 | 7.000000 | 2 | 2399.824 | 1199.912000 | 4 | 2.000000 | 5 | 2.500000 | 4 | 2.000000 | 0 | 0.000000 | 5 | 2.500000 | 17 | 4.250000 | 4 | 2630.818 | 657.704500 | 5 | 1.250000 | 7 | 1.750000 | 4 | 1.000000 | 0 | 0.000000 | 5 | 1.250000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 |
Finally, we'll join in our Transaction features:
# Joining feature DF's
df_fa = df_fa.join(transactions_features, how='inner')
print('%d Entries in the DF: ' %(len(df_fa)))
print('%d Features' %(len(df_fa.columns)))
df_fa.head()
88544 Entries in the DF:
143 Features
city | bd | gender | registered_via | registration_init_time | is_churn | (date_features, date, max_date) | (date_features, date, min_date) | (date_features, date, listening_tenure) | (within_days_1, num_unq, sum) | (within_days_1, num_unq, mean) | (within_days_1, num_unq, count) | (within_days_1, total_secs, sum) | (within_days_1, total_secs, mean) | (within_days_1, num_25, sum) | (within_days_1, num_25, mean) | (within_days_1, num_50, sum) | (within_days_1, num_50, mean) | (within_days_1, num_75, sum) | (within_days_1, num_75, mean) | (within_days_1, num_985, sum) | (within_days_1, num_985, mean) | (within_days_1, num_100, sum) | (within_days_1, num_100, mean) | (within_days_7, num_unq, sum) | (within_days_7, num_unq, mean) | (within_days_7, num_unq, count) | (within_days_7, total_secs, sum) | (within_days_7, total_secs, mean) | (within_days_7, num_25, sum) | (within_days_7, num_25, mean) | (within_days_7, num_50, sum) | (within_days_7, num_50, mean) | (within_days_7, num_75, sum) | (within_days_7, num_75, mean) | (within_days_7, num_985, sum) | (within_days_7, num_985, mean) | (within_days_7, num_100, sum) | (within_days_7, num_100, mean) | (within_days_14, num_unq, sum) | (within_days_14, num_unq, mean) | (within_days_14, num_unq, count) | (within_days_14, total_secs, sum) | (within_days_14, total_secs, mean) | (within_days_14, num_25, sum) | (within_days_14, num_25, mean) | (within_days_14, num_50, sum) | (within_days_14, num_50, mean) | (within_days_14, num_75, sum) | (within_days_14, num_75, mean) | (within_days_14, num_985, sum) | (within_days_14, num_985, mean) | (within_days_14, num_100, sum) | (within_days_14, num_100, mean) | (within_days_31, num_unq, sum) | (within_days_31, num_unq, mean) | (within_days_31, num_unq, count) | (within_days_31, total_secs, sum) | (within_days_31, total_secs, mean) | (within_days_31, num_25, sum) | (within_days_31, num_25, mean) | (within_days_31, num_50, sum) | (within_days_31, num_50, mean) | (within_days_31, num_75, sum) | (within_days_31, num_75, mean) | (within_days_31, num_985, sum) | (within_days_31, num_985, mean) | (within_days_31, num_100, sum) | (within_days_31, num_100, mean) | (within_days_90, num_unq, sum) | (within_days_90, num_unq, mean) | (within_days_90, num_unq, count) | (within_days_90, total_secs, sum) | (within_days_90, total_secs, mean) | (within_days_90, num_25, sum) | (within_days_90, num_25, mean) | (within_days_90, num_50, sum) | (within_days_90, num_50, mean) | (within_days_90, num_75, sum) | (within_days_90, num_75, mean) | (within_days_90, num_985, sum) | (within_days_90, num_985, mean) | (within_days_90, num_100, sum) | (within_days_90, num_100, mean) | (within_days_180, num_unq, sum) | (within_days_180, num_unq, mean) | (within_days_180, num_unq, count) | (within_days_180, total_secs, sum) | (within_days_180, total_secs, mean) | (within_days_180, num_25, sum) | (within_days_180, num_25, mean) | (within_days_180, num_50, sum) | (within_days_180, num_50, mean) | (within_days_180, num_75, sum) | (within_days_180, num_75, mean) | (within_days_180, num_985, sum) | (within_days_180, num_985, mean) | (within_days_180, num_100, sum) | (within_days_180, num_100, mean) | (within_days_365, num_unq, sum) | (within_days_365, num_unq, mean) | (within_days_365, num_unq, count) | (within_days_365, total_secs, sum) | (within_days_365, total_secs, mean) | (within_days_365, num_25, sum) | (within_days_365, num_25, mean) | (within_days_365, num_50, sum) | (within_days_365, num_50, mean) | (within_days_365, num_75, sum) | (within_days_365, num_75, mean) | (within_days_365, num_985, sum) | (within_days_365, num_985, mean) | (within_days_365, num_100, sum) | (within_days_365, num_100, mean) | (within_days_9999, num_unq, sum) | (within_days_9999, num_unq, mean) | (within_days_9999, num_unq, count) | (within_days_9999, total_secs, sum) | (within_days_9999, total_secs, mean) | (within_days_9999, num_25, sum) | (within_days_9999, num_25, mean) | (within_days_9999, num_50, sum) | (within_days_9999, num_50, mean) | (within_days_9999, num_75, sum) | (within_days_9999, num_75, mean) | (within_days_9999, num_985, sum) | (within_days_9999, num_985, mean) | (within_days_9999, num_100, sum) | (within_days_9999, num_100, mean) | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_transaction_date | latest_expire_date | latest_is_cancel | latest_amount_paid_per_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= | 1 | 0 | NaN | 13 | 20170120 | 0 | 2017-02-24 | 2017-01-20 | 35 | 1 | 1 | 1 | 10.068 | 10.068 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 27 | 6.750000 | 4 | 3158.450 | 789.612500 | 33 | 8.250000 | 1 | 0.250000 | 1 | 0.250000 | 1 | 0.250000 | 9 | 2.250000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 60 | 258 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170220 | 20170319 | 0 | 4.300000 |
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= | 1 | 0 | NaN | 13 | 20160907 | 0 | 2017-02-27 | 2016-09-07 | 173 | 21 | 21 | 1 | 2633.631 | 2633.631 | 13 | 13 | 3 | 3 | 1 | 1 | 1 | 1 | 8 | 8 | 228 | 32.571429 | 7 | 32731.138 | 4675.876857 | 140 | 20.000000 | 29 | 4.142857 | 14 | 2.000000 | 20 | 2.857143 | 95 | 13.571429 | 512 | 36.571429 | 14 | 98422.408 | 7030.172000 | 301 | 21.500000 | 71 | 5.071429 | 34 | 2.428571 | 42 | 3.000000 | 305 | 21.785714 | 1044 | 36.000000 | 29 | 178909.861 | 6169.305552 | 656 | 22.620690 | 135 | 4.655172 | 61 | 2.103448 | 74 | 2.551724 | 571 | 19.689655 | 4298 | 58.876712 | 73 | 632743.845 | 8667.723904 | 2717 | 37.219178 | 393 | 5.383562 | 188 | 2.575342 | 189 | 2.589041 | 2094 | 28.684932 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170207 | 20170306 | 0 | 4.300000 |
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= | 1 | 0 | NaN | 13 | 20160902 | 0 | 2017-02-26 | 2016-09-02 | 177 | 1 | 1 | 1 | 271.093 | 271.093 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 243 | 34.714286 | 7 | 60581.740 | 8654.534286 | 3 | 0.428571 | 1 | 0.142857 | 2 | 0.285714 | 1 | 0.142857 | 238 | 34.000000 | 489 | 34.928571 | 14 | 122772.792 | 8769.485143 | 32 | 2.285714 | 2 | 0.142857 | 5 | 0.357143 | 11 | 0.785714 | 476 | 34.000000 | 899 | 32.107143 | 28 | 231622.820 | 8272.243571 | 73 | 2.607143 | 7 | 0.250000 | 11 | 0.392857 | 14 | 0.500000 | 893 | 31.892857 | 2396 | 35.235294 | 68 | 770040.608 | 11324.126588 | 180 | 2.647059 | 32 | 0.470588 | 32 | 0.470588 | 33 | 0.485294 | 2953 | 43.426471 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170202 | 20170301 | 0 | 4.300000 |
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= | 1 | 0 | NaN | 13 | 20161028 | 0 | 2017-02-28 | 2016-10-28 | 123 | 17 | 17 | 1 | 1626.704 | 1626.704 | 15 | 15 | 1 | 1 | 1 | 1 | 0 | 0 | 6 | 6 | 121 | 24.200000 | 5 | 30054.147 | 6010.829400 | 29 | 5.800000 | 10 | 2.000000 | 5 | 1.000000 | 2 | 0.400000 | 123 | 24.600000 | 192 | 17.454545 | 11 | 43518.795 | 3956.254091 | 44 | 4.000000 | 11 | 1.000000 | 7 | 0.636364 | 7 | 0.636364 | 174 | 15.818182 | 457 | 17.576923 | 26 | 111841.140 | 4301.582308 | 80 | 3.076923 | 23 | 0.884615 | 15 | 0.576923 | 15 | 0.576923 | 456 | 17.538462 | 1229 | 21.189655 | 58 | 287422.839 | 4955.566190 | 203 | 3.500000 | 81 | 1.396552 | 55 | 0.948276 | 60 | 1.034483 | 1145 | 19.741379 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 150 | 596 | 3.973333 | 0 | 0.0 | 30 | 30 | 149 | 149 | 1 | 20170228 | 20170327 | 0 | 4.966667 |
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= | 1 | 0 | NaN | 13 | 20161004 | 0 | 2016-10-26 | 2016-10-04 | 22 | 1 | 1 | 1 | 156.204 | 156.204 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 14 | 7.000000 | 2 | 2399.824 | 1199.912000 | 4 | 2.000000 | 5 | 2.500000 | 4 | 2.000000 | 0 | 0.000000 | 5 | 2.500000 | 17 | 4.250000 | 4 | 2630.818 | 657.704500 | 5 | 1.250000 | 7 | 1.750000 | 4 | 1.000000 | 0 | 0.000000 | 5 | 1.250000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 150 | 645 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170204 | 20170303 | 0 | 4.300000 |
Data Manipulation
Having joined all the features into a single file, we will now perform some data manipulation tasks to prepare the table for predictive modeling.
First, we will fix the column headers:
#Fix column headers
df_fa.columns = df_fa.columns.map(''.join)
df_fa.head()
city | bd | gender | registered_via | registration_init_time | is_churn | date_featuresdatemax_date | date_featuresdatemin_date | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_transaction_date | latest_expire_date | latest_is_cancel | latest_amount_paid_per_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= | 1 | 0 | NaN | 13 | 20170120 | 0 | 2017-02-24 | 2017-01-20 | 35 | 1 | 1 | 1 | 10.068 | 10.068 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 27 | 6.750000 | 4 | 3158.450 | 789.612500 | 33 | 8.250000 | 1 | 0.250000 | 1 | 0.250000 | 1 | 0.250000 | 9 | 2.250000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 60 | 258 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170220 | 20170319 | 0 | 4.300000 |
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= | 1 | 0 | NaN | 13 | 20160907 | 0 | 2017-02-27 | 2016-09-07 | 173 | 21 | 21 | 1 | 2633.631 | 2633.631 | 13 | 13 | 3 | 3 | 1 | 1 | 1 | 1 | 8 | 8 | 228 | 32.571429 | 7 | 32731.138 | 4675.876857 | 140 | 20.000000 | 29 | 4.142857 | 14 | 2.000000 | 20 | 2.857143 | 95 | 13.571429 | 512 | 36.571429 | 14 | 98422.408 | 7030.172000 | 301 | 21.500000 | 71 | 5.071429 | 34 | 2.428571 | 42 | 3.000000 | 305 | 21.785714 | 1044 | 36.000000 | 29 | 178909.861 | 6169.305552 | 656 | 22.620690 | 135 | 4.655172 | 61 | 2.103448 | 74 | 2.551724 | 571 | 19.689655 | 4298 | 58.876712 | 73 | 632743.845 | 8667.723904 | 2717 | 37.219178 | 393 | 5.383562 | 188 | 2.575342 | 189 | 2.589041 | 2094 | 28.684932 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170207 | 20170306 | 0 | 4.300000 |
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= | 1 | 0 | NaN | 13 | 20160902 | 0 | 2017-02-26 | 2016-09-02 | 177 | 1 | 1 | 1 | 271.093 | 271.093 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 243 | 34.714286 | 7 | 60581.740 | 8654.534286 | 3 | 0.428571 | 1 | 0.142857 | 2 | 0.285714 | 1 | 0.142857 | 238 | 34.000000 | 489 | 34.928571 | 14 | 122772.792 | 8769.485143 | 32 | 2.285714 | 2 | 0.142857 | 5 | 0.357143 | 11 | 0.785714 | 476 | 34.000000 | 899 | 32.107143 | 28 | 231622.820 | 8272.243571 | 73 | 2.607143 | 7 | 0.250000 | 11 | 0.392857 | 14 | 0.500000 | 893 | 31.892857 | 2396 | 35.235294 | 68 | 770040.608 | 11324.126588 | 180 | 2.647059 | 32 | 0.470588 | 32 | 0.470588 | 33 | 0.485294 | 2953 | 43.426471 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170202 | 20170301 | 0 | 4.300000 |
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= | 1 | 0 | NaN | 13 | 20161028 | 0 | 2017-02-28 | 2016-10-28 | 123 | 17 | 17 | 1 | 1626.704 | 1626.704 | 15 | 15 | 1 | 1 | 1 | 1 | 0 | 0 | 6 | 6 | 121 | 24.200000 | 5 | 30054.147 | 6010.829400 | 29 | 5.800000 | 10 | 2.000000 | 5 | 1.000000 | 2 | 0.400000 | 123 | 24.600000 | 192 | 17.454545 | 11 | 43518.795 | 3956.254091 | 44 | 4.000000 | 11 | 1.000000 | 7 | 0.636364 | 7 | 0.636364 | 174 | 15.818182 | 457 | 17.576923 | 26 | 111841.140 | 4301.582308 | 80 | 3.076923 | 23 | 0.884615 | 15 | 0.576923 | 15 | 0.576923 | 456 | 17.538462 | 1229 | 21.189655 | 58 | 287422.839 | 4955.566190 | 203 | 3.500000 | 81 | 1.396552 | 55 | 0.948276 | 60 | 1.034483 | 1145 | 19.741379 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 150 | 596 | 3.973333 | 0 | 0.0 | 30 | 30 | 149 | 149 | 1 | 20170228 | 20170327 | 0 | 4.966667 |
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= | 1 | 0 | NaN | 13 | 20161004 | 0 | 2016-10-26 | 2016-10-04 | 22 | 1 | 1 | 1 | 156.204 | 156.204 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 14 | 7.000000 | 2 | 2399.824 | 1199.912000 | 4 | 2.000000 | 5 | 2.500000 | 4 | 2.000000 | 0 | 0.000000 | 5 | 2.500000 | 17 | 4.250000 | 4 | 2630.818 | 657.704500 | 5 | 1.250000 | 7 | 1.750000 | 4 | 1.000000 | 0 | 0.000000 | 5 | 1.250000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 150 | 645 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170204 | 20170303 | 0 | 4.300000 |
Next, we will change infinite and 'na' values to -9999, wildly different than other values in the range, so that our algorithms see them as 'different'.
#Handle bad values
df_fa['amount_paid_per_day'].replace([np.inf, -np.inf], -9999, inplace=True)
df_fa['latest_amount_paid_per_day'].replace([np.inf, -np.inf], -9999, inplace=True)
df_fa['diff_plan_amount_paid_per_day'].replace([np.inf, -np.inf], -9999, inplace=True)
df_fa['diff_plan_amount_paid_per_day'].fillna(-9999, inplace=True)
df_fa.isnull().any()
city False
bd False
gender True
registered_via False
registration_init_time False
is_churn False
date_featuresdatemax_date False
date_featuresdatemin_date False
date_featuresdatelistening_tenure False
within_days_1num_unqsum False
within_days_1num_unqmean False
within_days_1num_unqcount False
within_days_1total_secssum False
within_days_1total_secsmean False
within_days_1num_25sum False
within_days_1num_25mean False
within_days_1num_50sum False
within_days_1num_50mean False
within_days_1num_75sum False
within_days_1num_75mean False
within_days_1num_985sum False
within_days_1num_985mean False
within_days_1num_100sum False
within_days_1num_100mean False
within_days_7num_unqsum False
within_days_7num_unqmean False
within_days_7num_unqcount False
within_days_7total_secssum False
within_days_7total_secsmean False
within_days_7num_25sum False
within_days_7num_25mean False
within_days_7num_50sum False
within_days_7num_50mean False
within_days_7num_75sum False
within_days_7num_75mean False
within_days_7num_985sum False
within_days_7num_985mean False
within_days_7num_100sum False
within_days_7num_100mean False
within_days_14num_unqsum False
within_days_14num_unqmean False
within_days_14num_unqcount False
within_days_14total_secssum False
within_days_14total_secsmean False
within_days_14num_25sum False
within_days_14num_25mean False
within_days_14num_50sum False
within_days_14num_50mean False
within_days_14num_75sum False
within_days_14num_75mean False
within_days_14num_985sum False
within_days_14num_985mean False
within_days_14num_100sum False
within_days_14num_100mean False
within_days_31num_unqsum False
within_days_31num_unqmean False
within_days_31num_unqcount False
within_days_31total_secssum False
within_days_31total_secsmean False
within_days_31num_25sum False
within_days_31num_25mean False
within_days_31num_50sum False
within_days_31num_50mean False
within_days_31num_75sum False
within_days_31num_75mean False
within_days_31num_985sum False
within_days_31num_985mean False
within_days_31num_100sum False
within_days_31num_100mean False
within_days_90num_unqsum False
within_days_90num_unqmean False
within_days_90num_unqcount False
within_days_90total_secssum False
within_days_90total_secsmean False
within_days_90num_25sum False
within_days_90num_25mean False
within_days_90num_50sum False
within_days_90num_50mean False
within_days_90num_75sum False
within_days_90num_75mean False
within_days_90num_985sum False
within_days_90num_985mean False
within_days_90num_100sum False
within_days_90num_100mean False
within_days_180num_unqsum False
within_days_180num_unqmean False
within_days_180num_unqcount False
within_days_180total_secssum False
within_days_180total_secsmean False
within_days_180num_25sum False
within_days_180num_25mean False
within_days_180num_50sum False
within_days_180num_50mean False
within_days_180num_75sum False
within_days_180num_75mean False
within_days_180num_985sum False
within_days_180num_985mean False
within_days_180num_100sum False
within_days_180num_100mean False
within_days_365num_unqsum False
within_days_365num_unqmean False
within_days_365num_unqcount False
within_days_365total_secssum False
within_days_365total_secsmean False
within_days_365num_25sum False
within_days_365num_25mean False
within_days_365num_50sum False
within_days_365num_50mean False
within_days_365num_75sum False
within_days_365num_75mean False
within_days_365num_985sum False
within_days_365num_985mean False
within_days_365num_100sum False
within_days_365num_100mean False
within_days_9999num_unqsum False
within_days_9999num_unqmean False
within_days_9999num_unqcount False
within_days_9999total_secssum False
within_days_9999total_secsmean False
within_days_9999num_25sum False
within_days_9999num_25mean False
within_days_9999num_50sum False
within_days_9999num_50mean False
within_days_9999num_75sum False
within_days_9999num_75mean False
within_days_9999num_985sum False
within_days_9999num_985mean False
within_days_9999num_100sum False
within_days_9999num_100mean False
total_plan_days False
total_amount_paid False
amount_paid_per_day False
diff_renewal_duration False
diff_plan_amount_paid_per_day False
latest_payment_method_id False
latest_plan_days False
latest_plan_list_price False
latest_amount_paid False
latest_auto_renew False
latest_transaction_date False
latest_expire_date False
latest_is_cancel False
latest_amount_paid_per_day False
dtype: bool
The cell above uploads verifies we have no null values in our data.
Now let's inspect our data types:
df_fa.dtypes
city int64
bd int64
gender object
registered_via int64
registration_init_time int64
is_churn int64
date_featuresdatemax_date datetime64[ns]
date_featuresdatemin_date datetime64[ns]
date_featuresdatelistening_tenure int64
within_days_1num_unqsum int64
within_days_1num_unqmean int64
within_days_1num_unqcount int64
within_days_1total_secssum float64
within_days_1total_secsmean float64
within_days_1num_25sum int64
within_days_1num_25mean int64
within_days_1num_50sum int64
within_days_1num_50mean int64
within_days_1num_75sum int64
within_days_1num_75mean int64
within_days_1num_985sum int64
within_days_1num_985mean int64
within_days_1num_100sum int64
within_days_1num_100mean int64
within_days_7num_unqsum int64
within_days_7num_unqmean float64
within_days_7num_unqcount int64
within_days_7total_secssum float64
within_days_7total_secsmean float64
within_days_7num_25sum int64
within_days_7num_25mean float64
within_days_7num_50sum int64
within_days_7num_50mean float64
within_days_7num_75sum int64
within_days_7num_75mean float64
within_days_7num_985sum int64
within_days_7num_985mean float64
within_days_7num_100sum int64
within_days_7num_100mean float64
within_days_14num_unqsum int64
within_days_14num_unqmean float64
within_days_14num_unqcount int64
within_days_14total_secssum float64
within_days_14total_secsmean float64
within_days_14num_25sum int64
within_days_14num_25mean float64
within_days_14num_50sum int64
within_days_14num_50mean float64
within_days_14num_75sum int64
within_days_14num_75mean float64
within_days_14num_985sum int64
within_days_14num_985mean float64
within_days_14num_100sum int64
within_days_14num_100mean float64
within_days_31num_unqsum int64
within_days_31num_unqmean float64
within_days_31num_unqcount int64
within_days_31total_secssum float64
within_days_31total_secsmean float64
within_days_31num_25sum int64
within_days_31num_25mean float64
within_days_31num_50sum int64
within_days_31num_50mean float64
within_days_31num_75sum int64
within_days_31num_75mean float64
within_days_31num_985sum int64
within_days_31num_985mean float64
within_days_31num_100sum int64
within_days_31num_100mean float64
within_days_90num_unqsum int64
within_days_90num_unqmean float64
within_days_90num_unqcount int64
within_days_90total_secssum float64
within_days_90total_secsmean float64
within_days_90num_25sum int64
within_days_90num_25mean float64
within_days_90num_50sum int64
within_days_90num_50mean float64
within_days_90num_75sum int64
within_days_90num_75mean float64
within_days_90num_985sum int64
within_days_90num_985mean float64
within_days_90num_100sum int64
within_days_90num_100mean float64
within_days_180num_unqsum int64
within_days_180num_unqmean float64
within_days_180num_unqcount int64
within_days_180total_secssum float64
within_days_180total_secsmean float64
within_days_180num_25sum int64
within_days_180num_25mean float64
within_days_180num_50sum int64
within_days_180num_50mean float64
within_days_180num_75sum int64
within_days_180num_75mean float64
within_days_180num_985sum int64
within_days_180num_985mean float64
within_days_180num_100sum int64
within_days_180num_100mean float64
within_days_365num_unqsum int64
within_days_365num_unqmean float64
within_days_365num_unqcount int64
within_days_365total_secssum float64
within_days_365total_secsmean float64
within_days_365num_25sum int64
within_days_365num_25mean float64
within_days_365num_50sum int64
within_days_365num_50mean float64
within_days_365num_75sum int64
within_days_365num_75mean float64
within_days_365num_985sum int64
within_days_365num_985mean float64
within_days_365num_100sum int64
within_days_365num_100mean float64
within_days_9999num_unqsum int64
within_days_9999num_unqmean float64
within_days_9999num_unqcount int64
within_days_9999total_secssum float64
within_days_9999total_secsmean float64
within_days_9999num_25sum int64
within_days_9999num_25mean float64
within_days_9999num_50sum int64
within_days_9999num_50mean float64
within_days_9999num_75sum int64
within_days_9999num_75mean float64
within_days_9999num_985sum int64
within_days_9999num_985mean float64
within_days_9999num_100sum int64
within_days_9999num_100mean float64
total_plan_days int64
total_amount_paid int64
amount_paid_per_day float64
diff_renewal_duration int64
diff_plan_amount_paid_per_day float64
latest_payment_method_id int64
latest_plan_days int64
latest_plan_list_price int64
latest_amount_paid int64
latest_auto_renew int64
latest_transaction_date int64
latest_expire_date int64
latest_is_cancel int64
latest_amount_paid_per_day float64
dtype: object
We see we have a couple datetime objects in the file. We'll need to address these, as the ML algorithms don't like them. The code below breaks datetime formatted columns up into 4 separate columns.
def split_date_col(date_col_name):
"""Function that takes a column of datetime64[ns] items and converts it into 4 columns:
1) Year integer
2) Month integer
3) Day integer
4) Days since January 1, 2001, as an integer
It then deletes the original date
Args:
date_col_name (string): The column name, as a string.
"""
df_fa[date_col_name + '_year'] = df_fa[date_col_name].dt.year
df_fa[date_col_name + '_month'] = df_fa[date_col_name].dt.month
df_fa[date_col_name + '_day'] = df_fa[date_col_name].dt.day
df_fa[date_col_name + '_absday'] = ((df_fa[date_col_name] - pd.to_datetime('1/1/2000'))
.astype('timedelta64[D]')
.astype('int64')
)
df_fa.drop(date_col_name, axis=1, inplace=True)
#Only run this cell once, else it will fail on the date columns it deletes
split_date_col('date_featuresdatemax_date')
split_date_col('date_featuresdatemin_date')
Now let's re-check our cells:
df_fa.dtypes
city int64
bd int64
gender object
registered_via int64
registration_init_time int64
is_churn int64
date_featuresdatelistening_tenure int64
within_days_1num_unqsum int64
within_days_1num_unqmean int64
within_days_1num_unqcount int64
within_days_1total_secssum float64
within_days_1total_secsmean float64
within_days_1num_25sum int64
within_days_1num_25mean int64
within_days_1num_50sum int64
within_days_1num_50mean int64
within_days_1num_75sum int64
within_days_1num_75mean int64
within_days_1num_985sum int64
within_days_1num_985mean int64
within_days_1num_100sum int64
within_days_1num_100mean int64
within_days_7num_unqsum int64
within_days_7num_unqmean float64
within_days_7num_unqcount int64
within_days_7total_secssum float64
within_days_7total_secsmean float64
within_days_7num_25sum int64
within_days_7num_25mean float64
within_days_7num_50sum int64
within_days_7num_50mean float64
within_days_7num_75sum int64
within_days_7num_75mean float64
within_days_7num_985sum int64
within_days_7num_985mean float64
within_days_7num_100sum int64
within_days_7num_100mean float64
within_days_14num_unqsum int64
within_days_14num_unqmean float64
within_days_14num_unqcount int64
within_days_14total_secssum float64
within_days_14total_secsmean float64
within_days_14num_25sum int64
within_days_14num_25mean float64
within_days_14num_50sum int64
within_days_14num_50mean float64
within_days_14num_75sum int64
within_days_14num_75mean float64
within_days_14num_985sum int64
within_days_14num_985mean float64
within_days_14num_100sum int64
within_days_14num_100mean float64
within_days_31num_unqsum int64
within_days_31num_unqmean float64
within_days_31num_unqcount int64
within_days_31total_secssum float64
within_days_31total_secsmean float64
within_days_31num_25sum int64
within_days_31num_25mean float64
within_days_31num_50sum int64
within_days_31num_50mean float64
within_days_31num_75sum int64
within_days_31num_75mean float64
within_days_31num_985sum int64
within_days_31num_985mean float64
within_days_31num_100sum int64
within_days_31num_100mean float64
within_days_90num_unqsum int64
within_days_90num_unqmean float64
within_days_90num_unqcount int64
within_days_90total_secssum float64
within_days_90total_secsmean float64
within_days_90num_25sum int64
within_days_90num_25mean float64
within_days_90num_50sum int64
within_days_90num_50mean float64
within_days_90num_75sum int64
within_days_90num_75mean float64
within_days_90num_985sum int64
within_days_90num_985mean float64
within_days_90num_100sum int64
within_days_90num_100mean float64
within_days_180num_unqsum int64
within_days_180num_unqmean float64
within_days_180num_unqcount int64
within_days_180total_secssum float64
within_days_180total_secsmean float64
within_days_180num_25sum int64
within_days_180num_25mean float64
within_days_180num_50sum int64
within_days_180num_50mean float64
within_days_180num_75sum int64
within_days_180num_75mean float64
within_days_180num_985sum int64
within_days_180num_985mean float64
within_days_180num_100sum int64
within_days_180num_100mean float64
within_days_365num_unqsum int64
within_days_365num_unqmean float64
within_days_365num_unqcount int64
within_days_365total_secssum float64
within_days_365total_secsmean float64
within_days_365num_25sum int64
within_days_365num_25mean float64
within_days_365num_50sum int64
within_days_365num_50mean float64
within_days_365num_75sum int64
within_days_365num_75mean float64
within_days_365num_985sum int64
within_days_365num_985mean float64
within_days_365num_100sum int64
within_days_365num_100mean float64
within_days_9999num_unqsum int64
within_days_9999num_unqmean float64
within_days_9999num_unqcount int64
within_days_9999total_secssum float64
within_days_9999total_secsmean float64
within_days_9999num_25sum int64
within_days_9999num_25mean float64
within_days_9999num_50sum int64
within_days_9999num_50mean float64
within_days_9999num_75sum int64
within_days_9999num_75mean float64
within_days_9999num_985sum int64
within_days_9999num_985mean float64
within_days_9999num_100sum int64
within_days_9999num_100mean float64
total_plan_days int64
total_amount_paid int64
amount_paid_per_day float64
diff_renewal_duration int64
diff_plan_amount_paid_per_day float64
latest_payment_method_id int64
latest_plan_days int64
latest_plan_list_price int64
latest_amount_paid int64
latest_auto_renew int64
latest_transaction_date int64
latest_expire_date int64
latest_is_cancel int64
latest_amount_paid_per_day float64
date_featuresdatemax_date_year int64
date_featuresdatemax_date_month int64
date_featuresdatemax_date_day int64
date_featuresdatemax_date_absday int64
date_featuresdatemin_date_year int64
date_featuresdatemin_date_month int64
date_featuresdatemin_date_day int64
date_featuresdatemin_date_absday int64
dtype: object
df_fa.describe(include='all')
city | bd | gender | registered_via | registration_init_time | is_churn | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_transaction_date | latest_expire_date | latest_is_cancel | latest_amount_paid_per_day | date_featuresdatemax_date_year | date_featuresdatemax_date_month | date_featuresdatemax_date_day | date_featuresdatemax_date_absday | date_featuresdatemin_date_year | date_featuresdatemin_date_month | date_featuresdatemin_date_day | date_featuresdatemin_date_absday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 88544.000000 | 88544.000000 | 45906 | 88544.000000 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.0 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 |
unique | NaN | NaN | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
top | NaN | NaN | male | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
freq | NaN | NaN | 24390 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mean | 6.525038 | 14.980733 | NaN | 6.635018 | 2.013375e+07 | 0.505590 | 489.896266 | 20.698794 | 20.698794 | 1.0 | 5292.578450 | 5292.578450 | 4.751005 | 4.751005 | 1.181175 | 1.181175 | 0.712719 | 0.712719 | 0.790985 | 0.790985 | 20.047660 | 20.047660 | 94.714458 | 21.900082 | 3.630568 | 25398.043300 | 5679.751873 | 20.606806 | 5.048088 | 5.118721 | 1.264542 | 3.193474 | 0.764835 | 3.609866 | 0.839836 | 96.809236 | 21.497445 | 180.072506 | 22.383794 | 6.679696 | 4.851858e+04 | 5818.581592 | 38.873690 | 5.179982 | 9.669735 | 1.298046 | 6.069570 | 0.785710 | 6.831553 | 0.857796 | 185.109313 | 22.017982 | 386.728101 | 23.271521 | 13.902444 | 1.039538e+05 | 6040.184570 | 84.352706 | 5.450459 | 20.850831 | 1.359949 | 13.151473 | 0.825081 | 14.948783 | 0.904241 | 396.595557 | 22.841633 | 1093.917736 | 24.238664 | 37.852435 | -8.333365e+11 | -1.651270e+10 | 241.791143 | 5.797655 | 59.157650 | 1.442505 | 37.497256 | 0.874044 | 42.036321 | 0.943332 | 1121.233093 | 23.773736 | 2090.805633 | 24.740990 | 71.289811 | -1.979174e+12 | -2.291994e+10 | 464.508132 | 5.968142 | 114.264603 | 1.498918 | 72.136554 | 0.902613 | 80.350267 | 0.967418 | 2124.856648 | 24.115101 | 3896.207400 | 25.245055 | 131.113401 | -6.041691e+12 | -3.677399e+10 | 859.351520 | 6.090492 | 212.907018 | 1.541631 | 133.405527 | 0.921047 | 148.187715 | 0.979698 | 3949.711895 | 24.569486 | 6768.144403 | 25.652204 | 222.584850 | -2.994804e+14 | -7.498238e+11 | 1476.962256 | 6.214718 | 367.734708 | 1.582176 | 229.020227 | 0.940062 | 254.208992 | 0.992622 | 6864.914574 | 24.937062 | 437.796169 | 2037.669125 | 4.602800 | 2.989858 | -5.606482 | 37.700296 | 52.462742 | 227.611199 | 227.036185 | 0.639874 | 2.016869e+07 | 2.016998e+07 | 0.174659 | 4.249737 | 2016.881234 | 2.525117 | 21.763507 | 6234.178363 | 2015.427177 | 4.265992 | 11.133335 | 5744.282097 |
std | 6.551445 | 18.431336 | NaN | 2.234529 | 2.896439e+04 | 0.499972 | 271.954761 | 26.816538 | 26.816538 | 0.0 | 7839.883628 | 7839.883628 | 11.209580 | 11.209580 | 3.122491 | 3.122491 | 1.569186 | 1.569186 | 2.056548 | 2.056548 | 31.926074 | 31.926074 | 123.736267 | 22.179267 | 2.019227 | 37825.652389 | 6814.978016 | 40.174639 | 9.170749 | 9.780551 | 2.416056 | 5.525064 | 1.240117 | 8.967416 | 1.630039 | 154.190616 | 27.963725 | 232.545094 | 20.836810 | 4.052415 | 7.152130e+04 | 6539.990675 | 69.286828 | 8.255339 | 16.423555 | 2.159017 | 9.495713 | 1.138329 | 16.007510 | 1.546520 | 291.280077 | 26.898005 | 477.079939 | 19.372295 | 8.663977 | 1.469971e+05 | 6020.766839 | 135.636164 | 7.741246 | 30.899691 | 1.919181 | 18.358084 | 1.029136 | 36.104464 | 1.563621 | 600.717536 | 24.834585 | 1341.237189 | 18.684102 | 25.324540 | 2.191767e+14 | 3.964256e+12 | 382.272707 | 7.515411 | 82.445972 | 1.770876 | 50.426151 | 0.969375 | 89.884588 | 1.397562 | 1663.191691 | 24.005779 | 2583.529737 | 18.290344 | 50.173053 | 2.754958e+14 | 2.817791e+12 | 890.415732 | 7.904848 | 158.377603 | 1.721930 | 96.141760 | 0.912695 | 166.964108 | 1.336536 | 3129.427760 | 23.396972 | 4890.713126 | 17.842956 | 99.124183 | 3.640763e+14 | 2.148897e+12 | 2152.429448 | 8.587733 | 298.543419 | 1.634560 | 179.646957 | 0.882435 | 271.766276 | 1.137067 | 5839.985612 | 22.801391 | 9203.006883 | 17.417302 | 196.804976 | 3.620312e+15 | 9.459203e+12 | 3814.395114 | 8.261136 | 548.144978 | 1.622525 | 327.643138 | 0.872478 | 440.184627 | 1.023119 | 10865.335596 | 22.230638 | 241.813007 | 1234.637258 | 1.085134 | 51.117939 | 237.546015 | 4.484667 | 79.217446 | 332.886694 | 333.154660 | 0.480039 | 3.996365e+03 | 6.095780e+03 | 0.379677 | 47.556868 | 0.372700 | 2.183868 | 8.070249 | 94.115610 | 0.580881 | 3.899945 | 9.493778 | 261.169278 |
min | 1.000000 | -49.000000 | NaN | 3.000000 | 2.004033e+07 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.0 | 0.030000 | 0.030000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.078000 | 0.078000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 7.800000e-02 | 0.078000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 2.130000e-01 | 0.213000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -6.456360e+16 | -1.132695e+15 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -7.378698e+16 | -6.895979e+14 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -7.378698e+16 | -3.617009e+14 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -3.135946e+17 | -8.384884e+14 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | -420.000000 | -9999.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.015010e+07 | 1.970010e+07 | 0.000000 | -9999.000000 | 2015.000000 | 1.000000 | 1.000000 | 5479.000000 | 2015.000000 | 1.000000 | 1.000000 | 5479.000000 |
25% | 1.000000 | 0.000000 | NaN | 4.000000 | 2.012053e+07 | 0.000000 | 248.000000 | 4.000000 | 4.000000 | 1.0 | 821.058500 | 821.058500 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 17.000000 | 7.571429 | 2.000000 | 3718.143000 | 1705.418000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 12.000000 | 5.666667 | 34.000000 | 9.200000 | 3.000000 | 7.838667e+03 | 2116.791375 | 5.000000 | 1.250000 | 1.000000 | 0.307692 | 1.000000 | 0.166667 | 1.000000 | 0.166667 | 27.000000 | 7.200000 | 84.000000 | 11.153846 | 6.000000 | 1.964592e+04 | 2611.727415 | 14.000000 | 1.714286 | 3.000000 | 0.450000 | 2.000000 | 0.294118 | 2.000000 | 0.300000 | 69.000000 | 9.000000 | 233.000000 | 12.710028 | 16.000000 | 5.537270e+04 | 2.998566e+03 | 42.000000 | 2.113636 | 11.000000 | 0.565217 | 7.000000 | 0.382353 | 7.000000 | 0.387097 | 196.000000 | 10.500000 | 427.000000 | 13.490147 | 27.000000 | 1.021806e+05 | 3.180790e+03 | 81.000000 | 2.329249 | 22.000000 | 0.625000 | 14.000000 | 0.426950 | 14.000000 | 0.427574 | 360.000000 | 11.166667 | 737.750000 | 14.224490 | 43.000000 | 1.753236e+05 | 3.351469e+03 | 145.000000 | 2.510204 | 39.000000 | 0.677632 | 25.000000 | 0.454545 | 25.000000 | 0.454545 | 623.000000 | 11.806058 | 1002.000000 | 14.798119 | 56.000000 | 2.145400e+05 | 3.373914e+03 | 205.000000 | 2.651424 | 56.000000 | 0.722222 | 35.000000 | 0.481250 | 35.000000 | 0.478528 | 853.000000 | 12.291667 | 240.000000 | 990.000000 | 3.936907 | 0.000000 | 0.000000 | 36.000000 | 30.000000 | 99.000000 | 99.000000 | 0.000000 | 2.017012e+07 | 2.017021e+07 | 0.000000 | 3.300000 | 2017.000000 | 2.000000 | 17.000000 | 6247.000000 | 2015.000000 | 1.000000 | 2.000000 | 5481.000000 |
50% | 4.000000 | 17.000000 | NaN | 7.000000 | 2.014073e+07 | 1.000000 | 509.000000 | 11.000000 | 11.000000 | 1.0 | 2595.075000 | 2595.075000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 9.000000 | 53.000000 | 16.000000 | 3.000000 | 12904.532500 | 3780.617429 | 8.000000 | 2.600000 | 2.000000 | 0.666667 | 1.000000 | 0.500000 | 2.000000 | 0.500000 | 47.000000 | 13.750000 | 105.000000 | 17.100000 | 6.000000 | 2.553986e+04 | 4091.579700 | 17.000000 | 3.000000 | 4.000000 | 0.800000 | 3.000000 | 0.500000 | 3.000000 | 0.538462 | 93.000000 | 15.000000 | 238.000000 | 18.777778 | 13.000000 | 5.835679e+04 | 4496.575193 | 42.000000 | 3.500000 | 11.000000 | 0.928571 | 7.000000 | 0.612903 | 8.000000 | 0.636364 | 214.000000 | 16.500000 | 667.000000 | 19.993750 | 36.000000 | 1.652595e+05 | 4.812825e+03 | 123.000000 | 3.935484 | 33.000000 | 1.000000 | 22.000000 | 0.671875 | 22.000000 | 0.692308 | 605.000000 | 17.683439 | 1263.000000 | 20.583333 | 65.000000 | 3.145153e+05 | 4.970085e+03 | 235.000000 | 4.128933 | 63.000000 | 1.070945 | 41.000000 | 0.706897 | 43.000000 | 0.727273 | 1149.000000 | 18.196254 | 2317.000000 | 21.266667 | 115.000000 | 5.750376e+05 | 5.154735e+03 | 431.000000 | 4.326370 | 117.000000 | 1.130952 | 76.000000 | 0.733333 | 78.000000 | 0.750000 | 2102.000000 | 18.867847 | 3584.000000 | 21.936890 | 169.000000 | 8.452459e+05 | 5.233477e+03 | 681.000000 | 4.500000 | 186.000000 | 1.181818 | 119.000000 | 0.757725 | 121.000000 | 0.769231 | 3299.000000 | 19.508859 | 440.000000 | 1788.000000 | 4.898667 | 0.000000 | 0.000000 | 39.000000 | 30.000000 | 149.000000 | 149.000000 | 1.000000 | 2.017020e+07 | 2.017030e+07 | 0.000000 | 4.966667 | 2017.000000 | 2.000000 | 26.000000 | 6265.000000 | 2015.000000 | 2.000000 | 8.000000 | 5710.000000 |
75% | 13.000000 | 27.000000 | NaN | 9.000000 | 2.016012e+07 | 1.000000 | 776.000000 | 27.000000 | 27.000000 | 1.0 | 6509.136500 | 6509.136500 | 5.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 24.000000 | 24.000000 | 124.000000 | 28.666667 | 5.000000 | 31483.468500 | 7136.028982 | 24.000000 | 6.000000 | 6.000000 | 1.500000 | 4.000000 | 1.000000 | 4.000000 | 1.000000 | 118.000000 | 26.833333 | 236.000000 | 29.000000 | 10.000000 | 5.991438e+04 | 7206.990196 | 46.000000 | 6.285714 | 12.000000 | 1.571429 | 8.000000 | 1.000000 | 8.000000 | 1.000000 | 225.000000 | 27.090909 | 511.000000 | 29.545455 | 20.000000 | 1.296859e+05 | 7375.110225 | 103.000000 | 6.720556 | 26.000000 | 1.666667 | 17.000000 | 1.052632 | 18.000000 | 1.125000 | 484.000000 | 27.826087 | 1464.000000 | 30.166667 | 56.000000 | 3.737988e+05 | 7.603972e+03 | 295.000000 | 7.100000 | 76.000000 | 1.769231 | 49.000000 | 1.098120 | 52.000000 | 1.166667 | 1394.000000 | 28.634146 | 2802.000000 | 30.577022 | 108.000000 | 7.176428e+05 | 7.747661e+03 | 567.000000 | 7.290698 | 147.000000 | 1.833333 | 95.000000 | 1.128512 | 100.000000 | 1.187500 | 2663.000000 | 29.039801 | 5258.000000 | 31.154067 | 203.000000 | 1.347311e+06 | 7.914708e+03 | 1058.000000 | 7.466156 | 275.000000 | 1.888889 | 175.000000 | 1.149123 | 185.000000 | 1.200000 | 5003.000000 | 29.706403 | 9046.250000 | 31.791510 | 346.000000 | 2.223149e+06 | 7.989009e+03 | 1798.000000 | 7.621627 | 468.000000 | 1.942504 | 297.000000 | 1.167134 | 313.000000 | 1.217949 | 8601.250000 | 30.343668 | 615.000000 | 3193.000000 | 5.164424 | 0.000000 | 0.000000 | 41.000000 | 30.000000 | 149.000000 | 149.000000 | 1.000000 | 2.017022e+07 | 2.017032e+07 | 0.000000 | 4.966667 | 2017.000000 | 2.000000 | 28.000000 | 6268.000000 | 2016.000000 | 8.000000 | 19.000000 | 5959.000000 |
max | 22.000000 | 1051.000000 | NaN | 13.000000 | 2.017023e+07 | 1.000000 | 789.000000 | 661.000000 | 661.000000 | 1.0 | 437603.468000 | 437603.468000 | 570.000000 | 570.000000 | 195.000000 | 195.000000 | 71.000000 | 71.000000 | 224.000000 | 224.000000 | 1859.000000 | 1859.000000 | 2992.000000 | 748.000000 | 7.000000 | 604487.476000 | 437603.468000 | 3957.000000 | 989.250000 | 354.000000 | 132.500000 | 369.000000 | 65.500000 | 1207.000000 | 172.428571 | 3211.000000 | 1859.000000 | 5290.000000 | 661.000000 | 14.000000 | 1.209334e+06 | 437603.468000 | 5381.000000 | 676.000000 | 613.000000 | 132.500000 | 385.000000 | 65.500000 | 1877.000000 | 156.416667 | 5765.000000 | 1859.000000 | 10086.000000 | 560.333333 | 31.000000 | 2.665440e+06 | 187136.586000 | 10384.000000 | 676.000000 | 1131.000000 | 132.500000 | 585.000000 | 65.500000 | 3930.000000 | 151.826087 | 13493.000000 | 810.000000 | 26703.000000 | 560.333333 | 90.000000 | 7.546893e+06 | 2.181567e+05 | 35029.000000 | 676.000000 | 3017.000000 | 88.333333 | 1336.000000 | 65.500000 | 11151.000000 | 159.300000 | 35188.000000 | 871.250000 | 52598.000000 | 560.333333 | 180.000000 | 1.519852e+07 | 1.782986e+05 | 166936.000000 | 932.603352 | 8311.000000 | 104.333333 | 3287.000000 | 65.500000 | 19803.000000 | 133.446429 | 64190.000000 | 710.333333 | 103640.000000 | 560.333333 | 365.000000 | 2.993833e+07 | 1.549087e+05 | 519451.000000 | 1427.063187 | 19246.000000 | 60.500000 | 8016.000000 | 65.500000 | 26622.000000 | 86.155340 | 140785.000000 | 647.666667 | 234810.000000 | 560.333333 | 790.000000 | 9.223372e+15 | 1.358376e+13 | 911417.000000 | 1298.314815 | 37859.000000 | 60.500000 | 16436.000000 | 65.500000 | 27315.000000 | 58.358824 | 387552.000000 | 647.666667 | 1690.000000 | 5908.000000 | 53.189189 | 450.000000 | 6.000000 | 41.000000 | 450.000000 | 2000.000000 | 2000.000000 | 1.000000 | 2.017023e+07 | 2.017033e+07 | 1.000000 | 6.000000 | 2017.000000 | 12.000000 | 31.000000 | 6268.000000 | 2017.000000 | 12.000000 | 31.000000 | 6268.000000 |
Next, we convert the gender variable (a string) to dummy encoding:
#Convert gender variable:
dummy = pd.get_dummies(df_fa['gender'])
df_fa = pd.concat([df_fa, dummy], axis=1)
df_fa.drop('gender', axis=1, inplace=True)
"""Note, we're not concerned about collinearity having both a female and a male category,
as there are several cases where both values are 0, presumably because the user did not
supply the information. Thus, the two columns, male and female, capture the 3 cases:
male, female, and 'not supplied'. """
;
''
A couple more quick inspections:
df_fa.head()
city | bd | registered_via | registration_init_time | is_churn | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_transaction_date | latest_expire_date | latest_is_cancel | latest_amount_paid_per_day | date_featuresdatemax_date_year | date_featuresdatemax_date_month | date_featuresdatemax_date_day | date_featuresdatemax_date_absday | date_featuresdatemin_date_year | date_featuresdatemin_date_month | date_featuresdatemin_date_day | date_featuresdatemin_date_absday | female | male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= | 1 | 0 | 13 | 20170120 | 0 | 35 | 1 | 1 | 1 | 10.068 | 10.068 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 27 | 6.750000 | 4 | 3158.450 | 789.612500 | 33 | 8.250000 | 1 | 0.250000 | 1 | 0.250000 | 1 | 0.250000 | 9 | 2.250000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 60 | 258 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170220 | 20170319 | 0 | 4.300000 | 2017 | 2 | 24 | 6264 | 2017 | 1 | 20 | 6229 | 0 | 0 |
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= | 1 | 0 | 13 | 20160907 | 0 | 173 | 21 | 21 | 1 | 2633.631 | 2633.631 | 13 | 13 | 3 | 3 | 1 | 1 | 1 | 1 | 8 | 8 | 228 | 32.571429 | 7 | 32731.138 | 4675.876857 | 140 | 20.000000 | 29 | 4.142857 | 14 | 2.000000 | 20 | 2.857143 | 95 | 13.571429 | 512 | 36.571429 | 14 | 98422.408 | 7030.172000 | 301 | 21.500000 | 71 | 5.071429 | 34 | 2.428571 | 42 | 3.000000 | 305 | 21.785714 | 1044 | 36.000000 | 29 | 178909.861 | 6169.305552 | 656 | 22.620690 | 135 | 4.655172 | 61 | 2.103448 | 74 | 2.551724 | 571 | 19.689655 | 4298 | 58.876712 | 73 | 632743.845 | 8667.723904 | 2717 | 37.219178 | 393 | 5.383562 | 188 | 2.575342 | 189 | 2.589041 | 2094 | 28.684932 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170207 | 20170306 | 0 | 4.300000 | 2017 | 2 | 27 | 6267 | 2016 | 9 | 7 | 6094 | 0 | 0 |
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= | 1 | 0 | 13 | 20160902 | 0 | 177 | 1 | 1 | 1 | 271.093 | 271.093 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 243 | 34.714286 | 7 | 60581.740 | 8654.534286 | 3 | 0.428571 | 1 | 0.142857 | 2 | 0.285714 | 1 | 0.142857 | 238 | 34.000000 | 489 | 34.928571 | 14 | 122772.792 | 8769.485143 | 32 | 2.285714 | 2 | 0.142857 | 5 | 0.357143 | 11 | 0.785714 | 476 | 34.000000 | 899 | 32.107143 | 28 | 231622.820 | 8272.243571 | 73 | 2.607143 | 7 | 0.250000 | 11 | 0.392857 | 14 | 0.500000 | 893 | 31.892857 | 2396 | 35.235294 | 68 | 770040.608 | 11324.126588 | 180 | 2.647059 | 32 | 0.470588 | 32 | 0.470588 | 33 | 0.485294 | 2953 | 43.426471 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170202 | 20170301 | 0 | 4.300000 | 2017 | 2 | 26 | 6266 | 2016 | 9 | 2 | 6089 | 0 | 0 |
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= | 1 | 0 | 13 | 20161028 | 0 | 123 | 17 | 17 | 1 | 1626.704 | 1626.704 | 15 | 15 | 1 | 1 | 1 | 1 | 0 | 0 | 6 | 6 | 121 | 24.200000 | 5 | 30054.147 | 6010.829400 | 29 | 5.800000 | 10 | 2.000000 | 5 | 1.000000 | 2 | 0.400000 | 123 | 24.600000 | 192 | 17.454545 | 11 | 43518.795 | 3956.254091 | 44 | 4.000000 | 11 | 1.000000 | 7 | 0.636364 | 7 | 0.636364 | 174 | 15.818182 | 457 | 17.576923 | 26 | 111841.140 | 4301.582308 | 80 | 3.076923 | 23 | 0.884615 | 15 | 0.576923 | 15 | 0.576923 | 456 | 17.538462 | 1229 | 21.189655 | 58 | 287422.839 | 4955.566190 | 203 | 3.500000 | 81 | 1.396552 | 55 | 0.948276 | 60 | 1.034483 | 1145 | 19.741379 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 150 | 596 | 3.973333 | 0 | 0.0 | 30 | 30 | 149 | 149 | 1 | 20170228 | 20170327 | 0 | 4.966667 | 2017 | 2 | 28 | 6268 | 2016 | 10 | 28 | 6145 | 0 | 0 |
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= | 1 | 0 | 13 | 20161004 | 0 | 22 | 1 | 1 | 1 | 156.204 | 156.204 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 14 | 7.000000 | 2 | 2399.824 | 1199.912000 | 4 | 2.000000 | 5 | 2.500000 | 4 | 2.000000 | 0 | 0.000000 | 5 | 2.500000 | 17 | 4.250000 | 4 | 2630.818 | 657.704500 | 5 | 1.250000 | 7 | 1.750000 | 4 | 1.000000 | 0 | 0.000000 | 5 | 1.250000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 150 | 645 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 20170204 | 20170303 | 0 | 4.300000 | 2016 | 10 | 26 | 6143 | 2016 | 10 | 4 | 6121 | 0 | 0 |
The team added a few more features to improve the model, comparing (subtracting) the latest transaction date, latest expiry date, and latest user log date in the cell below:
#First transform these into datetime, then into 4 components
df_fa['latest_transaction_date'] = pd_to_date(df_fa['latest_transaction_date'])
df_fa['latest_expire_date'] = pd_to_date(df_fa['latest_expire_date'])
split_date_col('latest_transaction_date')
split_date_col('latest_expire_date')
#Now perform the subtraction of all 3 combinations
df_fa['latest_trans_vs_expire'] = df_fa['latest_transaction_date_absday'] - df_fa['latest_expire_date_absday']
df_fa['latest_trans_vs_log'] = df_fa['latest_transaction_date_absday'] - df_fa['date_featuresdatemax_date_absday']
df_fa['latest_log_vs_expire'] = df_fa['date_featuresdatemax_date_absday'] - df_fa['latest_expire_date_absday']
df_fa.head()
city | bd | registered_via | registration_init_time | is_churn | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_is_cancel | latest_amount_paid_per_day | date_featuresdatemax_date_year | date_featuresdatemax_date_month | date_featuresdatemax_date_day | date_featuresdatemax_date_absday | date_featuresdatemin_date_year | date_featuresdatemin_date_month | date_featuresdatemin_date_day | date_featuresdatemin_date_absday | female | male | latest_transaction_date_year | latest_transaction_date_month | latest_transaction_date_day | latest_transaction_date_absday | latest_expire_date_year | latest_expire_date_month | latest_expire_date_day | latest_expire_date_absday | latest_trans_vs_expire | latest_trans_vs_log | latest_log_vs_expire | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= | 1 | 0 | 13 | 20170120 | 0 | 35 | 1 | 1 | 1 | 10.068 | 10.068 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 27 | 6.750000 | 4 | 3158.450 | 789.612500 | 33 | 8.250000 | 1 | 0.250000 | 1 | 0.250000 | 1 | 0.250000 | 9 | 2.250000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 60 | 258 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 0 | 4.300000 | 2017 | 2 | 24 | 6264 | 2017 | 1 | 20 | 6229 | 0 | 0 | 2017 | 2 | 20 | 6260 | 2017 | 3 | 19 | 6287 | -27 | -4 | -23 |
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= | 1 | 0 | 13 | 20160907 | 0 | 173 | 21 | 21 | 1 | 2633.631 | 2633.631 | 13 | 13 | 3 | 3 | 1 | 1 | 1 | 1 | 8 | 8 | 228 | 32.571429 | 7 | 32731.138 | 4675.876857 | 140 | 20.000000 | 29 | 4.142857 | 14 | 2.000000 | 20 | 2.857143 | 95 | 13.571429 | 512 | 36.571429 | 14 | 98422.408 | 7030.172000 | 301 | 21.500000 | 71 | 5.071429 | 34 | 2.428571 | 42 | 3.000000 | 305 | 21.785714 | 1044 | 36.000000 | 29 | 178909.861 | 6169.305552 | 656 | 22.620690 | 135 | 4.655172 | 61 | 2.103448 | 74 | 2.551724 | 571 | 19.689655 | 4298 | 58.876712 | 73 | 632743.845 | 8667.723904 | 2717 | 37.219178 | 393 | 5.383562 | 188 | 2.575342 | 189 | 2.589041 | 2094 | 28.684932 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 0 | 4.300000 | 2017 | 2 | 27 | 6267 | 2016 | 9 | 7 | 6094 | 0 | 0 | 2017 | 2 | 7 | 6247 | 2017 | 3 | 6 | 6274 | -27 | -20 | -7 |
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= | 1 | 0 | 13 | 20160902 | 0 | 177 | 1 | 1 | 1 | 271.093 | 271.093 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 243 | 34.714286 | 7 | 60581.740 | 8654.534286 | 3 | 0.428571 | 1 | 0.142857 | 2 | 0.285714 | 1 | 0.142857 | 238 | 34.000000 | 489 | 34.928571 | 14 | 122772.792 | 8769.485143 | 32 | 2.285714 | 2 | 0.142857 | 5 | 0.357143 | 11 | 0.785714 | 476 | 34.000000 | 899 | 32.107143 | 28 | 231622.820 | 8272.243571 | 73 | 2.607143 | 7 | 0.250000 | 11 | 0.392857 | 14 | 0.500000 | 893 | 31.892857 | 2396 | 35.235294 | 68 | 770040.608 | 11324.126588 | 180 | 2.647059 | 32 | 0.470588 | 32 | 0.470588 | 33 | 0.485294 | 2953 | 43.426471 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 0 | 4.300000 | 2017 | 2 | 26 | 6266 | 2016 | 9 | 2 | 6089 | 0 | 0 | 2017 | 2 | 2 | 6242 | 2017 | 3 | 1 | 6269 | -27 | -24 | -3 |
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= | 1 | 0 | 13 | 20161028 | 0 | 123 | 17 | 17 | 1 | 1626.704 | 1626.704 | 15 | 15 | 1 | 1 | 1 | 1 | 0 | 0 | 6 | 6 | 121 | 24.200000 | 5 | 30054.147 | 6010.829400 | 29 | 5.800000 | 10 | 2.000000 | 5 | 1.000000 | 2 | 0.400000 | 123 | 24.600000 | 192 | 17.454545 | 11 | 43518.795 | 3956.254091 | 44 | 4.000000 | 11 | 1.000000 | 7 | 0.636364 | 7 | 0.636364 | 174 | 15.818182 | 457 | 17.576923 | 26 | 111841.140 | 4301.582308 | 80 | 3.076923 | 23 | 0.884615 | 15 | 0.576923 | 15 | 0.576923 | 456 | 17.538462 | 1229 | 21.189655 | 58 | 287422.839 | 4955.566190 | 203 | 3.500000 | 81 | 1.396552 | 55 | 0.948276 | 60 | 1.034483 | 1145 | 19.741379 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 150 | 596 | 3.973333 | 0 | 0.0 | 30 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 28 | 6268 | 2016 | 10 | 28 | 6145 | 0 | 0 | 2017 | 2 | 28 | 6268 | 2017 | 3 | 27 | 6295 | -27 | 0 | -27 |
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= | 1 | 0 | 13 | 20161004 | 0 | 22 | 1 | 1 | 1 | 156.204 | 156.204 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 14 | 7.000000 | 2 | 2399.824 | 1199.912000 | 4 | 2.000000 | 5 | 2.500000 | 4 | 2.000000 | 0 | 0.000000 | 5 | 2.500000 | 17 | 4.250000 | 4 | 2630.818 | 657.704500 | 5 | 1.250000 | 7 | 1.750000 | 4 | 1.000000 | 0 | 0.000000 | 5 | 1.250000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 150 | 645 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 0 | 4.300000 | 2016 | 10 | 26 | 6143 | 2016 | 10 | 4 | 6121 | 0 | 0 | 2017 | 2 | 4 | 6244 | 2017 | 3 | 3 | 6271 | -27 | 101 | -128 |
df_fa.shape
(88544, 159)
7. Quick Exploratory Data Analysis
Next, we perform some quick EDA on our data.
Combined Dataset
color = ['red', 'blue']
plt.figure()
for color, i, name in zip(color, [0,1], ['no_churn', 'churn']):
plt.scatter(df_fa[df_fa['is_churn'] == i]['date_featuresdatelistening_tenure'],
df_fa[df_fa['is_churn'] == i]['within_days_7num_unqmean'], color = color, alpha = 0.2, label = name)
plt.legend(loc = 'best')
plt.xlabel('Listening Tenure')
plt.ylabel('Mean Number of Unique listening Periods in the last 7 days')
Text(0,0.5,'Mean Number of Unique listening Periods in the last 7 days')
Looking at the above plot, we can see that overall people who churn have low numbers of unique plays, meaning they aren't using the music service. We can also see a spike in number of unique plays for users who have a long tenure (>700 days). Intuitively this makes sense as users who are commited to the service (have used the service for a long time) may have developed lifestyle patterns where they use the service while driving/working/etc.
avg_price_no_churn = round(df_fa[df_fa['is_churn'] == 0]['amount_paid_per_day'].mean(), 2)
avg_price_is_churn = round(df_fa[df_fa['is_churn'] == 1]['amount_paid_per_day'].mean(), 2)
print('Avg cost/day for no churn: %.2f' %avg_price_no_churn)
print('Avg cost/day for churn: %.2f' %avg_price_is_churn)
Avg cost/day for no churn: 4.54
Avg cost/day for churn: 4.67
For users who churn, we can see that they tend to spend more per day on the music subscription service. Our analysis will end with an economic analysis which uses the predictions to model how much incentive should be offered to specific users. This will hopefully help incentivize those customers that will churn and prevent it.
corr = df_fa.iloc[:, 8:99:15].corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask)
<matplotlib.axes._subplots.AxesSubplot at 0x2261eb32550>
The correlation plot shows that there is correlation amongst the number of unique songs played within the last X number of days (where if the two X values are close together the correlation tends to be higher). This makes sense and is what was expected.
mean_col = df_fa.iloc[:, 7:98:15].mean()
mean_col
within_days_1num_unqmean 20.698794
within_days_7num_unqmean 21.900082
within_days_14num_unqmean 22.383794
within_days_31num_unqmean 23.271521
within_days_90num_unqmean 24.238664
within_days_180num_unqmean 24.740990
within_days_365num_unqmean 25.245055
dtype: float64
#plt.plot_date(reg_time, df_fa['within_days_7total_secsmean'], )
color = ['red', 'blue']
for color, i, name in zip(color, [0,1], ['no_churn', 'churn']):
plt.plot_date(pd.to_datetime(df_fa[df_fa['is_churn'] == i]['registration_init_time'], format = '%Y%m%d'),
df_fa[df_fa['is_churn'] == i]['within_days_7total_secsmean']/(60*60), color = color, alpha = 0.2, label = name)
plt.legend(loc = 'best')
plt.xlabel('Registration Date')
plt.ylim([0,20])
plt.ylabel('Number of Seconds (Avg) Listed during last 7 days')
Text(0,0.5,'Number of Seconds (Avg) Listed during last 7 days')
This plot shows the Registration Date vs Avg seconds listened over past 7 days. The intent will be to change this to transaction date as this will more accurately reflect when the user ends service. We can see a trend in more churn occuring in the last 4 years of the dataset. This could be due to an increased level of users however (same proportion of churn).
df_fa['registration_time'] = pd.to_datetime(df_fa['registration_init_time'], format = '%Y%m%d').map(lambda x: x.year)
reg_count = []
thirty_day_churn = []
for year in range(2005, 2018):
reg_count.append(sum(df_fa['registration_time'] == year))
thirty_day_churn.append(len(df_fa[(df_fa['registration_time'] == year) & (df_fa['date_featuresdatelistening_tenure'] < 30) & (df_fa['is_churn'] == 1)])/sum(df_fa['registration_time'] == year))
plt.bar(range(2005, 2018), reg_count)
plt.xlabel('Year of Registration')
plt.title('Registration Count Per Year')
plt.show
<function matplotlib.pyplot.show>
plt.bar(range(2005, 2018), thirty_day_churn)
plt.xlabel('Year')
plt.title('Proportion of 30 day Churn to Total Registration')
plt.show
<function matplotlib.pyplot.show>
This statistic shows that there is a high level of churn recently for users who are using a trial subscription, or who a very short duration subscription. From a business perspective this is concerning, because it means that a very high level of churn is occuring, and the problem needs to be solved to reduce churn levels back. The sharp spike may have been when the company started offering 30 days subscriptions or some other factor that has caused increased churn for customers with less than 30 day subscriptionps.
plt.hist(df_fa['date_featuresdatemax_date_month'])
plt.title('Histogram of the last listen date')
plt.show()
The dataset (from the dataset description) was collected to include churn statistics for the months of February and March. We can see here a large peak for the month of February, which intuitively makes sense because users are utilizing the service often and most recently listened in the month in which the data was collected.
print('Is Latest Cancel and Is Churn Correlation: %0.4f' %np.corrcoef(df_fa['latest_is_cancel'], df_fa['is_churn'])[0,1])
Is Latest Cancel and Is Churn Correlation: 0.4328
Intuitively we had thought there would be a very high correlation between these two variables. However we can see here that, while there is decent correlation between the two variables, they are not perfectly aligned. This is due to the times when a user cancels their service but does not churn (changes plan type, etc).
User Logs EDA
user_logs['num_unq'].groupby(user_logs['date'].dt.weekday_name).count().plot(kind = 'bar', title = 'Unique Play Count Grouped by Weekday')
<matplotlib.axes._subplots.AxesSubplot at 0x2261b9b3f28>
The figure above looks at listening behavior grouped by the day of the week. We can see that listening is fairly consitent across all days of the week. The above plot looks at listening counts, but doesn't analyze how long a user is listening for, we can look at the total seconds feature included in the dataset to determine this.
#This cell takes a long time to plot, print statistics to screen instead
user_logs.groupby(user_logs['date'].dt.weekday_name)['total_secs'].sum()
date
Friday -4.334985e+18
Monday -3.892263e+18
Saturday -3.431094e+18
Sunday -3.855369e+18
Thursday -3.919933e+18
Tuesday -3.615562e+18
Wednesday -3.486435e+18
Name: total_secs, dtype: float64
There is some odd behavior here, where all of the sums are large negative values. We need to analyze the column itself in order to determine why this odd behavior is observed.
user_logs['total_secs'].describe()
count 1.971063e+07
mean -1.346260e+12
std 1.116172e+14
min -9.223372e+15
25% 1.966237e+03
50% 4.703210e+03
75% 1.028291e+04
max 9.223372e+15
Name: total_secs, dtype: float64
The mean value for this column is in fact negative, which intuitively doesn't make sense as it is not possible to listen to a song a for a negative time period. Further analysis regarding the method in which the software calculates total_secs will be important to better understand why negative values are being written to the database.
To further explore the total seconds columns, below we analyze the sum of the total seconds by user.
#This cell takes a long time to plot, print statistics to screen instead
user_logs['total_secs'].groupby(user_logs['msno'], sort = False).sum().describe()
count 8.855000e+04
mean -2.996684e+14
std 3.620704e+15
min -3.135946e+17
25% 2.145492e+05
50% 8.453256e+05
75% 2.223251e+06
max 9.223372e+15
Name: total_secs, dtype: float64
We continue to see negative values, even when summing over the total listening time for a specific user. This behavior is strange and now well understood. Further investigation into the matter will be necessary to determine why negative listening times are computed.
corr = user_logs.iloc[:, 2:7].corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask)
<matplotlib.axes._subplots.AxesSubplot at 0x2261ba97630>
This correlation plot anaylzes the correlation between the num columns in the dataset. From the definition at the start of this notebook, these columns represent the songs that have been played to a certain proportion of the total song length (25%, 50%, etc). Here we can see relatively low correlation between the columns.
The higher listening buckets do not factor in songs that fall below the bucket cutoff (ie for the num_50 bucket it doesn't count a song if it is played <50%, but rather 25-50% of the time). We found this behavior insteresting and wanted to further analyze by plotting a bar chart grouped by these columns.
sum_num = []
cols = []
for col in user_logs.iloc[:,2:7].columns:
sum_num.append(user_logs[col].sum())
cols.append(col)
plt.bar(np.arange(len(sum_num)), sum_num)
plt.xticks(np.arange(len(sum_num)), cols)
plt.title('Sum of Proportion Song Listened To')
Text(0.5,1,'Sum of Proportion Song Listened To')
We can see that num_100 (or listening to the entire song) is by far the most popular method. Num_25 (or less than 25% listened to) was the second most popular column. Intuitively this makes sense as a user is most likely either to like a song and listen to the whole thing, or not like a song and quickly switch to a new song.
Members EDA
members['gender'].groupby(members['gender']).count().plot(kind = 'bar', title = 'Members grouped by gender')
<matplotlib.axes._subplots.AxesSubplot at 0x2261c88c4e0>
print('Proportion of gender column = N/A: %0.4f' %(members['gender'].isnull().sum()/len(members)))
Proportion of gender column = N/A: 0.4843
Breakdown by gender, we can see that there are overall a larger proportion of males that utilize the kkbox dataset. Additionally we can see that over 48% of this column is N/A values (meaning that the user didn't supply their gender). This has been accounted for in the combined dataset by separating into two colums of 'male' and 'female' and capturing data in binary format.
members['city'].groupby(members['city']).count().plot(kind = 'bar', title = 'Count by city')
<matplotlib.axes._subplots.AxesSubplot at 0x2261b4caf28>
We can see that overall city = 1 has a significantly higher proportion of users, than any other city. The mapping between numeric and actual city names isn't provided, however from an ML perspective this is alright, however for business purposes it would be preferable to have the city names.
members['bd'].groupby(members['bd']).count().plot(kind = 'bar', title = 'Count by birthdate')
<matplotlib.axes._subplots.AxesSubplot at 0x2261f0ccba8>
The birthdate column has a large peak of entries at the 0 value (presumably not provided by user). Below we filter out the zero value to look at this distribution of birthdays.
members['bd'].groupby(members['bd']).count().plot(kind = 'bar', title = 'Count by birthdate Filtered')
plt.xlim([3,100])
plt.ylim([0,3000])
plt.show()
The distribution of provided birthdays appears to be skewed slightly. From the first plot it was hard to determine the range of entries, below we use the describe function to look at the birthday column. From the data provided we see that there are negative values for the birthate (min = -49) which is odd and needs to be further understood. It would also be helpful to know how this integer is defined (presumably days since a specifc date).
members['bd'].describe()
count 89473.000000
mean 14.915069
std 18.416087
min -49.000000
25% 0.000000
50% 17.000000
75% 27.000000
max 1051.000000
Name: bd, dtype: float64
The members table provides information about each of the members subscribed to the KKBox service. Through our EDA we noted that there were a few columns with dispraportionately high values, which could make analysis using a machine learning algorithm difficult, as this one common value may make it difficult to discern information from the other data. By including a high number of features, our model will hopefully overcome this difficulty and accurately predict churn.
Moving forward it is preferrable to better understand the data sources and computation and clean up the data collection as best as possible.
Transactions EDA
The transactions table contains information related to the subscription to KKBox, including the plan list price, the actual price the user paid as well as payment methods and dates. For payment method, the column is contains integers that map to a specific type of payment, we have not been granted access to the mapping, however from an ML perspective, we can continue to carry out models. When deriving business insight however, access to the mapping of this column will be important.
transactions['payment_method_id'].groupby(transactions['payment_method_id']).count().plot(kind = 'bar', title = 'Count by payment type')
<matplotlib.axes._subplots.AxesSubplot at 0x2261b7f1588>
From the plot above we can see that payment method 41 is by far the most popular.
transactions['plan_list_price'].groupby(transactions['plan_list_price']).count().plot(kind = 'bar', title = 'Count by plan list price')
<matplotlib.axes._subplots.AxesSubplot at 0x2262186a6a0>
Above we analyze the popularity of specific plans on kkbox. From their website it appears that there is a free tier, monthly and yearly tiers (this information is as of todays date (4/21/2018) however we do not know the specific plan offerings for the period of data collection for this analysis.
From the plot above we can see there is an overwhelmingly popular plan price at approximately 149.
transactions['prop_collected'] = round(transactions['actual_amount_paid'] / transactions['plan_list_price'], 3)
transactions['prop_collected'].groupby(transactions['prop_collected']).count().plot(kind = 'bar', title = 'Count by Proportion of Bill Collected')
<matplotlib.axes._subplots.AxesSubplot at 0x22621958828>
There are two columns in the table that analyze the plan cost and amount collected. We wanted to look and see the distribution of memberships that do not collect the total price of the plan. From above we can see that most plans are fully paid, there is an infinity value which is most likely an error (plan list price of 0 presumably), and there is a peak at 80%
transactions['transaction_date'] = pd_to_date(transactions['transaction_date'])
transactions['is_cancel'].groupby(transactions['transaction_date'].dt.month).sum().plot(kind = 'bar', title = 'Is Cancel by Month')
<matplotlib.axes._subplots.AxesSubplot at 0x2262195d748>
Our team next wanted to look at the is_cancel category, which has some correlation with churn. By month we can see that users peaked in January and especially February. It would be interesting to see how this correlates to churn (as is_cancel isn't perfectly correlated). If there is a seasonality to churn, then it would be important for predicting churn, and providing business insight to offer incentives to users who's memberships expire in this timeframe to renew.
print('Is Latest Cancel and Is Auto Renew: %0.4f' %np.corrcoef(transactions['is_cancel'], transactions['is_auto_renew'])[0,1])
Is Latest Cancel and Is Auto Renew: 0.0806
Our team had thought that the cancel and auto renew categories would be negatively correlated (if a user was on an auto renew membership, they may not pay as much attention and forget to cancel/change membership). From the covariance we can see that there is almost no correlation between the two features.
The transactions
8. Writing Output
Having extracted our features and performing some data manipulation, we will now write the features file to a .pkl file, allowing us to use this output in the second notebook without having to run all the code above.
#Write all features to pkl file
df_fa.to_pickle('df_fa.pkl')
Music Churn: Predictive Modeling Notebook
Python Notebook 2 of 3
W207, Final Project
Spring, 2018
Team: Cameron Kennedy, Gaurav Khanna, Aaron Olson
Overview of Notebooks
For this project, the team created 3 separate Jupyter Notebooks to document its work. See notebook #1, (Data Preparation / Feature Extraction) for a brief description of each notebook.
Table of Contents (this notebook only)
- Setup and Loading Libraries
- Data Preparation
- Predictive Modeling!
- Calculating Probabilities
- Economic Impact
- Final Insights and Takeaways
1. Setup and Loading Libraries
#Import Required Libraries
#Data manipulation and visualization
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
%matplotlib inline
#Models et al
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
import xgboost
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import (brier_score_loss, precision_score, recall_score, f1_score, log_loss)
#from sklearn.preprocessing import CategoricalEncoder #Not yet released!
#Metrics
from sklearn.metrics import (roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score,
precision_score, confusion_matrix, classification_report)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
Now we'll load the data and print the first few rows:
# Load the data
df_fa = pd.read_pickle('df_fa.pkl') #Pickle format preserves file as python object
#Set initial parameter(s)
pd.set_option('display.max_rows', 200)
pd.options.display.max_columns = 2000
#Ensure it's what we expect:
print(df_fa.shape)
df_fa.head()
(88544, 160)
city | bd | registered_via | registration_init_time | is_churn | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_is_cancel | latest_amount_paid_per_day | date_featuresdatemax_date_year | date_featuresdatemax_date_month | date_featuresdatemax_date_day | date_featuresdatemax_date_absday | date_featuresdatemin_date_year | date_featuresdatemin_date_month | date_featuresdatemin_date_day | date_featuresdatemin_date_absday | female | male | latest_transaction_date_year | latest_transaction_date_month | latest_transaction_date_day | latest_transaction_date_absday | latest_expire_date_year | latest_expire_date_month | latest_expire_date_day | latest_expire_date_absday | latest_trans_vs_expire | latest_trans_vs_log | latest_log_vs_expire | registration_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= | 1 | 0 | 13 | 20170120 | 0 | 35 | 1 | 1 | 1 | 10.068 | 10.068 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 1 | 1.000000 | 1 | 10.068 | 10.068000 | 2 | 2.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 27 | 6.750000 | 4 | 3158.450 | 789.612500 | 33 | 8.250000 | 1 | 0.250000 | 1 | 0.250000 | 1 | 0.250000 | 9 | 2.250000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 29 | 4.833333 | 6 | 3245.638 | 540.939667 | 41 | 6.833333 | 1 | 0.166667 | 1 | 0.166667 | 1 | 0.166667 | 9 | 1.500000 | 60 | 258 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 0 | 4.300000 | 2017 | 2 | 24 | 6264 | 2017 | 1 | 20 | 6229 | 0 | 0 | 2017 | 2 | 20 | 6260 | 2017 | 3 | 19 | 6287 | -27 | -4 | -23 | 2017 |
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= | 1 | 0 | 13 | 20160907 | 0 | 173 | 21 | 21 | 1 | 2633.631 | 2633.631 | 13 | 13 | 3 | 3 | 1 | 1 | 1 | 1 | 8 | 8 | 228 | 32.571429 | 7 | 32731.138 | 4675.876857 | 140 | 20.000000 | 29 | 4.142857 | 14 | 2.000000 | 20 | 2.857143 | 95 | 13.571429 | 512 | 36.571429 | 14 | 98422.408 | 7030.172000 | 301 | 21.500000 | 71 | 5.071429 | 34 | 2.428571 | 42 | 3.000000 | 305 | 21.785714 | 1044 | 36.000000 | 29 | 178909.861 | 6169.305552 | 656 | 22.620690 | 135 | 4.655172 | 61 | 2.103448 | 74 | 2.551724 | 571 | 19.689655 | 4298 | 58.876712 | 73 | 632743.845 | 8667.723904 | 2717 | 37.219178 | 393 | 5.383562 | 188 | 2.575342 | 189 | 2.589041 | 2094 | 28.684932 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 9218 | 59.857143 | 154 | 1232770.399 | 8005.002591 | 5289 | 34.344156 | 838 | 5.441558 | 410 | 2.662338 | 323 | 2.097403 | 4204 | 27.298701 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 0 | 4.300000 | 2017 | 2 | 27 | 6267 | 2016 | 9 | 7 | 6094 | 0 | 0 | 2017 | 2 | 7 | 6247 | 2017 | 3 | 6 | 6274 | -27 | -20 | -7 | 2016 |
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= | 1 | 0 | 13 | 20160902 | 0 | 177 | 1 | 1 | 1 | 271.093 | 271.093 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 243 | 34.714286 | 7 | 60581.740 | 8654.534286 | 3 | 0.428571 | 1 | 0.142857 | 2 | 0.285714 | 1 | 0.142857 | 238 | 34.000000 | 489 | 34.928571 | 14 | 122772.792 | 8769.485143 | 32 | 2.285714 | 2 | 0.142857 | 5 | 0.357143 | 11 | 0.785714 | 476 | 34.000000 | 899 | 32.107143 | 28 | 231622.820 | 8272.243571 | 73 | 2.607143 | 7 | 0.250000 | 11 | 0.392857 | 14 | 0.500000 | 893 | 31.892857 | 2396 | 35.235294 | 68 | 770040.608 | 11324.126588 | 180 | 2.647059 | 32 | 0.470588 | 32 | 0.470588 | 33 | 0.485294 | 2953 | 43.426471 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 3580 | 32.844037 | 109 | 1137009.556 | 10431.280330 | 423 | 3.880734 | 72 | 0.660550 | 58 | 0.532110 | 58 | 0.532110 | 4308 | 39.522936 | 180 | 774 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 0 | 4.300000 | 2017 | 2 | 26 | 6266 | 2016 | 9 | 2 | 6089 | 0 | 0 | 2017 | 2 | 2 | 6242 | 2017 | 3 | 1 | 6269 | -27 | -24 | -3 | 2016 |
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= | 1 | 0 | 13 | 20161028 | 0 | 123 | 17 | 17 | 1 | 1626.704 | 1626.704 | 15 | 15 | 1 | 1 | 1 | 1 | 0 | 0 | 6 | 6 | 121 | 24.200000 | 5 | 30054.147 | 6010.829400 | 29 | 5.800000 | 10 | 2.000000 | 5 | 1.000000 | 2 | 0.400000 | 123 | 24.600000 | 192 | 17.454545 | 11 | 43518.795 | 3956.254091 | 44 | 4.000000 | 11 | 1.000000 | 7 | 0.636364 | 7 | 0.636364 | 174 | 15.818182 | 457 | 17.576923 | 26 | 111841.140 | 4301.582308 | 80 | 3.076923 | 23 | 0.884615 | 15 | 0.576923 | 15 | 0.576923 | 456 | 17.538462 | 1229 | 21.189655 | 58 | 287422.839 | 4955.566190 | 203 | 3.500000 | 81 | 1.396552 | 55 | 0.948276 | 60 | 1.034483 | 1145 | 19.741379 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 1441 | 20.884058 | 69 | 326268.069 | 4728.522739 | 247 | 3.579710 | 115 | 1.666667 | 74 | 1.072464 | 76 | 1.101449 | 1272 | 18.434783 | 150 | 596 | 3.973333 | 0 | 0.0 | 30 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 28 | 6268 | 2016 | 10 | 28 | 6145 | 0 | 0 | 2017 | 2 | 28 | 6268 | 2017 | 3 | 27 | 6295 | -27 | 0 | -27 | 2016 |
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= | 1 | 0 | 13 | 20161004 | 0 | 22 | 1 | 1 | 1 | 156.204 | 156.204 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 14 | 7.000000 | 2 | 2399.824 | 1199.912000 | 4 | 2.000000 | 5 | 2.500000 | 4 | 2.000000 | 0 | 0.000000 | 5 | 2.500000 | 17 | 4.250000 | 4 | 2630.818 | 657.704500 | 5 | 1.250000 | 7 | 1.750000 | 4 | 1.000000 | 0 | 0.000000 | 5 | 1.250000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 136 | 13.600000 | 10 | 15562.900 | 1556.290000 | 76 | 7.600000 | 36 | 3.600000 | 22 | 2.200000 | 7 | 0.700000 | 21 | 2.100000 | 150 | 645 | 4.300000 | 0 | 0.0 | 30 | 30 | 129 | 129 | 1 | 0 | 4.300000 | 2016 | 10 | 26 | 6143 | 2016 | 10 | 4 | 6121 | 0 | 0 | 2017 | 2 | 4 | 6244 | 2017 | 3 | 3 | 6271 | -27 | 101 | -128 | 2016 |
df_fa.describe(include='all')
city | bd | registered_via | registration_init_time | is_churn | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_is_cancel | latest_amount_paid_per_day | date_featuresdatemax_date_year | date_featuresdatemax_date_month | date_featuresdatemax_date_day | date_featuresdatemax_date_absday | date_featuresdatemin_date_year | date_featuresdatemin_date_month | date_featuresdatemin_date_day | date_featuresdatemin_date_absday | female | male | latest_transaction_date_year | latest_transaction_date_month | latest_transaction_date_day | latest_transaction_date_absday | latest_expire_date_year | latest_expire_date_month | latest_expire_date_day | latest_expire_date_absday | latest_trans_vs_expire | latest_trans_vs_log | latest_log_vs_expire | registration_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.0 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 8.854400e+04 | 8.854400e+04 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 | 88544.000000 |
mean | 6.525038 | 14.980733 | 6.635018 | 2.013375e+07 | 0.505590 | 489.896266 | 20.698794 | 20.698794 | 1.0 | 5292.578450 | 5292.578450 | 4.751005 | 4.751005 | 1.181175 | 1.181175 | 0.712719 | 0.712719 | 0.790985 | 0.790985 | 20.047660 | 20.047660 | 94.714458 | 21.900082 | 3.630568 | 25398.043300 | 5679.751873 | 20.606806 | 5.048088 | 5.118721 | 1.264542 | 3.193474 | 0.764835 | 3.609866 | 0.839836 | 96.809236 | 21.497445 | 180.072506 | 22.383794 | 6.679696 | 4.851858e+04 | 5818.581592 | 38.873690 | 5.179982 | 9.669735 | 1.298046 | 6.069570 | 0.785710 | 6.831553 | 0.857796 | 185.109313 | 22.017982 | 386.728101 | 23.271521 | 13.902444 | 1.039538e+05 | 6040.184570 | 84.352706 | 5.450459 | 20.850831 | 1.359949 | 13.151473 | 0.825081 | 14.948783 | 0.904241 | 396.595557 | 22.841633 | 1093.917736 | 24.238664 | 37.852435 | -8.333365e+11 | -1.651270e+10 | 241.791143 | 5.797655 | 59.157650 | 1.442505 | 37.497256 | 0.874044 | 42.036321 | 0.943332 | 1121.233093 | 23.773736 | 2090.805633 | 24.740990 | 71.289811 | -1.979174e+12 | -2.291994e+10 | 464.508132 | 5.968142 | 114.264603 | 1.498918 | 72.136554 | 0.902613 | 80.350267 | 0.967418 | 2124.856648 | 24.115101 | 3896.207400 | 25.245055 | 131.113401 | -6.041691e+12 | -3.677399e+10 | 859.351520 | 6.090492 | 212.907018 | 1.541631 | 133.405527 | 0.921047 | 148.187715 | 0.979698 | 3949.711895 | 24.569486 | 6768.144403 | 25.652204 | 222.584850 | -2.994804e+14 | -7.498238e+11 | 1476.962256 | 6.214718 | 367.734708 | 1.582176 | 229.020227 | 0.940062 | 254.208992 | 0.992622 | 6864.914574 | 24.937062 | 437.796169 | 2037.669125 | 4.602800 | 2.989858 | -5.606482 | 37.700296 | 52.462742 | 227.611199 | 227.036185 | 0.639874 | 0.174659 | 4.249737 | 2016.881234 | 2.525117 | 21.763507 | 6234.178363 | 2015.427177 | 4.265992 | 11.133335 | 5744.282097 | 0.242998 | 0.275456 | 2016.841740 | 2.525739 | 16.539788 | 6214.398627 | 2016.970320 | 2.601565 | 15.721178 | 6261.820010 | -47.421384 | -19.779737 | -27.641647 | 2013.308423 |
std | 6.551445 | 18.431336 | 2.234529 | 2.896439e+04 | 0.499972 | 271.954761 | 26.816538 | 26.816538 | 0.0 | 7839.883628 | 7839.883628 | 11.209580 | 11.209580 | 3.122491 | 3.122491 | 1.569186 | 1.569186 | 2.056548 | 2.056548 | 31.926074 | 31.926074 | 123.736267 | 22.179267 | 2.019227 | 37825.652389 | 6814.978016 | 40.174639 | 9.170749 | 9.780551 | 2.416056 | 5.525064 | 1.240117 | 8.967416 | 1.630039 | 154.190616 | 27.963725 | 232.545094 | 20.836810 | 4.052415 | 7.152130e+04 | 6539.990675 | 69.286828 | 8.255339 | 16.423555 | 2.159017 | 9.495713 | 1.138329 | 16.007510 | 1.546520 | 291.280077 | 26.898005 | 477.079939 | 19.372295 | 8.663977 | 1.469971e+05 | 6020.766839 | 135.636164 | 7.741246 | 30.899691 | 1.919181 | 18.358084 | 1.029136 | 36.104464 | 1.563621 | 600.717536 | 24.834585 | 1341.237189 | 18.684102 | 25.324540 | 2.191767e+14 | 3.964256e+12 | 382.272707 | 7.515411 | 82.445972 | 1.770876 | 50.426151 | 0.969375 | 89.884588 | 1.397562 | 1663.191691 | 24.005779 | 2583.529737 | 18.290344 | 50.173053 | 2.754958e+14 | 2.817791e+12 | 890.415732 | 7.904848 | 158.377603 | 1.721930 | 96.141760 | 0.912695 | 166.964108 | 1.336536 | 3129.427760 | 23.396972 | 4890.713126 | 17.842956 | 99.124183 | 3.640763e+14 | 2.148897e+12 | 2152.429448 | 8.587733 | 298.543419 | 1.634560 | 179.646957 | 0.882435 | 271.766276 | 1.137067 | 5839.985612 | 22.801391 | 9203.006883 | 17.417302 | 196.804976 | 3.620312e+15 | 9.459203e+12 | 3814.395114 | 8.261136 | 548.144978 | 1.622525 | 327.643138 | 0.872478 | 440.184627 | 1.023119 | 10865.335596 | 22.230638 | 241.813007 | 1234.637258 | 1.085134 | 51.117939 | 237.546015 | 4.484667 | 79.217446 | 332.886694 | 333.154660 | 0.480039 | 0.379677 | 47.556868 | 0.372700 | 2.183868 | 8.070249 | 94.115610 | 0.580881 | 3.899945 | 9.493778 | 261.169278 | 0.428896 | 0.446746 | 0.421588 | 2.609233 | 8.988002 | 96.625628 | 0.610677 | 0.950015 | 8.729970 | 221.500069 | 232.235329 | 123.588333 | 231.906681 | 2.899407 |
min | 1.000000 | -49.000000 | 3.000000 | 2.004033e+07 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.0 | 0.030000 | 0.030000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.078000 | 0.078000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 7.800000e-02 | 0.078000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 2.130000e-01 | 0.213000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -6.456360e+16 | -1.132695e+15 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -7.378698e+16 | -6.895979e+14 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -7.378698e+16 | -3.617009e+14 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -3.135946e+17 | -8.384884e+14 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | -420.000000 | -9999.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9999.000000 | 2015.000000 | 1.000000 | 1.000000 | 5479.000000 | 2015.000000 | 1.000000 | 1.000000 | 5479.000000 | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 5480.000000 | 1970.000000 | 1.000000 | 1.000000 | -10957.000000 | -771.000000 | -783.000000 | -820.000000 | 2004.000000 |
25% | 1.000000 | 0.000000 | 4.000000 | 2.012053e+07 | 0.000000 | 248.000000 | 4.000000 | 4.000000 | 1.0 | 821.058500 | 821.058500 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 17.000000 | 7.571429 | 2.000000 | 3718.143000 | 1705.418000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 12.000000 | 5.666667 | 34.000000 | 9.200000 | 3.000000 | 7.838667e+03 | 2116.791375 | 5.000000 | 1.250000 | 1.000000 | 0.307692 | 1.000000 | 0.166667 | 1.000000 | 0.166667 | 27.000000 | 7.200000 | 84.000000 | 11.153846 | 6.000000 | 1.964592e+04 | 2611.727415 | 14.000000 | 1.714286 | 3.000000 | 0.450000 | 2.000000 | 0.294118 | 2.000000 | 0.300000 | 69.000000 | 9.000000 | 233.000000 | 12.710028 | 16.000000 | 5.537270e+04 | 2.998566e+03 | 42.000000 | 2.113636 | 11.000000 | 0.565217 | 7.000000 | 0.382353 | 7.000000 | 0.387097 | 196.000000 | 10.500000 | 427.000000 | 13.490147 | 27.000000 | 1.021806e+05 | 3.180790e+03 | 81.000000 | 2.329249 | 22.000000 | 0.625000 | 14.000000 | 0.426950 | 14.000000 | 0.427574 | 360.000000 | 11.166667 | 737.750000 | 14.224490 | 43.000000 | 1.753236e+05 | 3.351469e+03 | 145.000000 | 2.510204 | 39.000000 | 0.677632 | 25.000000 | 0.454545 | 25.000000 | 0.454545 | 623.000000 | 11.806058 | 1002.000000 | 14.798119 | 56.000000 | 2.145400e+05 | 3.373914e+03 | 205.000000 | 2.651424 | 56.000000 | 0.722222 | 35.000000 | 0.481250 | 35.000000 | 0.478528 | 853.000000 | 12.291667 | 240.000000 | 990.000000 | 3.936907 | 0.000000 | 0.000000 | 36.000000 | 30.000000 | 99.000000 | 99.000000 | 0.000000 | 0.000000 | 3.300000 | 2017.000000 | 2.000000 | 17.000000 | 6247.000000 | 2015.000000 | 1.000000 | 2.000000 | 5481.000000 | 0.000000 | 0.000000 | 2017.000000 | 1.000000 | 9.000000 | 6226.000000 | 2017.000000 | 2.000000 | 8.000000 | 6254.000000 | -31.000000 | -28.000000 | -25.000000 | 2012.000000 |
50% | 4.000000 | 17.000000 | 7.000000 | 2.014073e+07 | 1.000000 | 509.000000 | 11.000000 | 11.000000 | 1.0 | 2595.075000 | 2595.075000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 9.000000 | 53.000000 | 16.000000 | 3.000000 | 12904.532500 | 3780.617429 | 8.000000 | 2.600000 | 2.000000 | 0.666667 | 1.000000 | 0.500000 | 2.000000 | 0.500000 | 47.000000 | 13.750000 | 105.000000 | 17.100000 | 6.000000 | 2.553986e+04 | 4091.579700 | 17.000000 | 3.000000 | 4.000000 | 0.800000 | 3.000000 | 0.500000 | 3.000000 | 0.538462 | 93.000000 | 15.000000 | 238.000000 | 18.777778 | 13.000000 | 5.835679e+04 | 4496.575193 | 42.000000 | 3.500000 | 11.000000 | 0.928571 | 7.000000 | 0.612903 | 8.000000 | 0.636364 | 214.000000 | 16.500000 | 667.000000 | 19.993750 | 36.000000 | 1.652595e+05 | 4.812825e+03 | 123.000000 | 3.935484 | 33.000000 | 1.000000 | 22.000000 | 0.671875 | 22.000000 | 0.692308 | 605.000000 | 17.683439 | 1263.000000 | 20.583333 | 65.000000 | 3.145153e+05 | 4.970085e+03 | 235.000000 | 4.128933 | 63.000000 | 1.070945 | 41.000000 | 0.706897 | 43.000000 | 0.727273 | 1149.000000 | 18.196254 | 2317.000000 | 21.266667 | 115.000000 | 5.750376e+05 | 5.154735e+03 | 431.000000 | 4.326370 | 117.000000 | 1.130952 | 76.000000 | 0.733333 | 78.000000 | 0.750000 | 2102.000000 | 18.867847 | 3584.000000 | 21.936890 | 169.000000 | 8.452459e+05 | 5.233477e+03 | 681.000000 | 4.500000 | 186.000000 | 1.181818 | 119.000000 | 0.757725 | 121.000000 | 0.769231 | 3299.000000 | 19.508859 | 440.000000 | 1788.000000 | 4.898667 | 0.000000 | 0.000000 | 39.000000 | 30.000000 | 149.000000 | 149.000000 | 1.000000 | 0.000000 | 4.966667 | 2017.000000 | 2.000000 | 26.000000 | 6265.000000 | 2015.000000 | 2.000000 | 8.000000 | 5710.000000 | 0.000000 | 0.000000 | 2017.000000 | 2.000000 | 17.000000 | 6245.000000 | 2017.000000 | 3.000000 | 16.000000 | 6271.000000 | -29.000000 | -14.000000 | -11.000000 | 2014.000000 |
75% | 13.000000 | 27.000000 | 9.000000 | 2.016012e+07 | 1.000000 | 776.000000 | 27.000000 | 27.000000 | 1.0 | 6509.136500 | 6509.136500 | 5.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 24.000000 | 24.000000 | 124.000000 | 28.666667 | 5.000000 | 31483.468500 | 7136.028982 | 24.000000 | 6.000000 | 6.000000 | 1.500000 | 4.000000 | 1.000000 | 4.000000 | 1.000000 | 118.000000 | 26.833333 | 236.000000 | 29.000000 | 10.000000 | 5.991438e+04 | 7206.990196 | 46.000000 | 6.285714 | 12.000000 | 1.571429 | 8.000000 | 1.000000 | 8.000000 | 1.000000 | 225.000000 | 27.090909 | 511.000000 | 29.545455 | 20.000000 | 1.296859e+05 | 7375.110225 | 103.000000 | 6.720556 | 26.000000 | 1.666667 | 17.000000 | 1.052632 | 18.000000 | 1.125000 | 484.000000 | 27.826087 | 1464.000000 | 30.166667 | 56.000000 | 3.737988e+05 | 7.603972e+03 | 295.000000 | 7.100000 | 76.000000 | 1.769231 | 49.000000 | 1.098120 | 52.000000 | 1.166667 | 1394.000000 | 28.634146 | 2802.000000 | 30.577022 | 108.000000 | 7.176428e+05 | 7.747661e+03 | 567.000000 | 7.290698 | 147.000000 | 1.833333 | 95.000000 | 1.128512 | 100.000000 | 1.187500 | 2663.000000 | 29.039801 | 5258.000000 | 31.154067 | 203.000000 | 1.347311e+06 | 7.914708e+03 | 1058.000000 | 7.466156 | 275.000000 | 1.888889 | 175.000000 | 1.149123 | 185.000000 | 1.200000 | 5003.000000 | 29.706403 | 9046.250000 | 31.791510 | 346.000000 | 2.223149e+06 | 7.989009e+03 | 1798.000000 | 7.621627 | 468.000000 | 1.942504 | 297.000000 | 1.167134 | 313.000000 | 1.217949 | 8601.250000 | 30.343668 | 615.000000 | 3193.000000 | 5.164424 | 0.000000 | 0.000000 | 41.000000 | 30.000000 | 149.000000 | 149.000000 | 1.000000 | 0.000000 | 4.966667 | 2017.000000 | 2.000000 | 28.000000 | 6268.000000 | 2016.000000 | 8.000000 | 19.000000 | 5959.000000 | 0.000000 | 1.000000 | 2017.000000 | 2.000000 | 25.000000 | 6258.000000 | 2017.000000 | 3.000000 | 23.000000 | 6285.000000 | -28.000000 | 0.000000 | -1.000000 | 2016.000000 |
max | 22.000000 | 1051.000000 | 13.000000 | 2.017023e+07 | 1.000000 | 789.000000 | 661.000000 | 661.000000 | 1.0 | 437603.468000 | 437603.468000 | 570.000000 | 570.000000 | 195.000000 | 195.000000 | 71.000000 | 71.000000 | 224.000000 | 224.000000 | 1859.000000 | 1859.000000 | 2992.000000 | 748.000000 | 7.000000 | 604487.476000 | 437603.468000 | 3957.000000 | 989.250000 | 354.000000 | 132.500000 | 369.000000 | 65.500000 | 1207.000000 | 172.428571 | 3211.000000 | 1859.000000 | 5290.000000 | 661.000000 | 14.000000 | 1.209334e+06 | 437603.468000 | 5381.000000 | 676.000000 | 613.000000 | 132.500000 | 385.000000 | 65.500000 | 1877.000000 | 156.416667 | 5765.000000 | 1859.000000 | 10086.000000 | 560.333333 | 31.000000 | 2.665440e+06 | 187136.586000 | 10384.000000 | 676.000000 | 1131.000000 | 132.500000 | 585.000000 | 65.500000 | 3930.000000 | 151.826087 | 13493.000000 | 810.000000 | 26703.000000 | 560.333333 | 90.000000 | 7.546893e+06 | 2.181567e+05 | 35029.000000 | 676.000000 | 3017.000000 | 88.333333 | 1336.000000 | 65.500000 | 11151.000000 | 159.300000 | 35188.000000 | 871.250000 | 52598.000000 | 560.333333 | 180.000000 | 1.519852e+07 | 1.782986e+05 | 166936.000000 | 932.603352 | 8311.000000 | 104.333333 | 3287.000000 | 65.500000 | 19803.000000 | 133.446429 | 64190.000000 | 710.333333 | 103640.000000 | 560.333333 | 365.000000 | 2.993833e+07 | 1.549087e+05 | 519451.000000 | 1427.063187 | 19246.000000 | 60.500000 | 8016.000000 | 65.500000 | 26622.000000 | 86.155340 | 140785.000000 | 647.666667 | 234810.000000 | 560.333333 | 790.000000 | 9.223372e+15 | 1.358376e+13 | 911417.000000 | 1298.314815 | 37859.000000 | 60.500000 | 16436.000000 | 65.500000 | 27315.000000 | 58.358824 | 387552.000000 | 647.666667 | 1690.000000 | 5908.000000 | 53.189189 | 450.000000 | 6.000000 | 41.000000 | 450.000000 | 2000.000000 | 2000.000000 | 1.000000 | 1.000000 | 6.000000 | 2017.000000 | 12.000000 | 31.000000 | 6268.000000 | 2017.000000 | 12.000000 | 31.000000 | 6268.000000 | 1.000000 | 1.000000 | 2017.000000 | 12.000000 | 31.000000 | 6268.000000 | 2017.000000 | 12.000000 | 31.000000 | 6299.000000 | 17207.000000 | 789.000000 | 17225.000000 | 2017.000000 |
2. Data Preparation
Splitting Train, Dev, and Test
First, we need to split the data into our train, dev, and test sets, which we'll do at rates of 60%, 25%, and 15% respectively.
#Split data into a) train, dev, & test, b) data & labels
np.random.seed(5) #Set so that % churn is somewhat consistent
#Train, Dev, Test splits: 60/25/15
train, devtest = train_test_split(df_fa, test_size=0.4)
dev, test = train_test_split(devtest, test_size=15/40)
#Calculate churn percentages
churn_rate_all = df_fa['is_churn'].sum() / df_fa['is_churn'].count()
churn_rate_train = train['is_churn'].sum() / train['is_churn'].count()
churn_rate_dev = dev['is_churn'].sum() / dev['is_churn'].count()
churn_rate_test = test['is_churn'].sum() / test['is_churn'].count()
#Print churn percentages
print('Check churn percentages:')
print(' All data, % churn: {:.1%}'.format(churn_rate_all))
print('Train data, % churn: {:.1%}'.format(churn_rate_train))
print(' Dev data, % churn: {:.1%}'.format(churn_rate_dev))
print(' Test data, % churn: {:.1%}'.format(churn_rate_test))
Check churn percentages:
All data, % churn: 50.6%
Train data, % churn: 50.7%
Dev data, % churn: 50.4%
Test data, % churn: 50.2%
Training data is fine at 50% churn (we get more training examples for churn) Changing Dev and Test back to real world (6%)
#Reduce dev set to 6% churn
#Select x rows is_churn == 1; append to all rows where is_churn == 0
churn_rate_actual = 0.11 #Emperically this works
dev_churn_split_factor = (churn_rate_dev * churn_rate_actual) / (1 - churn_rate_actual)
dummy, dev_sub = train_test_split(dev[dev.is_churn==1], test_size=dev_churn_split_factor)
# dev = pd.concat([dev[dev.is_churn==0], dev_sub], ignore_index=True)
# We'll not ignore the index. We want msno as the index
dev = pd.concat([dev[dev.is_churn==0], dev_sub])
# Test
dev.head()
city | bd | registered_via | registration_init_time | is_churn | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_is_cancel | latest_amount_paid_per_day | date_featuresdatemax_date_year | date_featuresdatemax_date_month | date_featuresdatemax_date_day | date_featuresdatemax_date_absday | date_featuresdatemin_date_year | date_featuresdatemin_date_month | date_featuresdatemin_date_day | date_featuresdatemin_date_absday | female | male | latest_transaction_date_year | latest_transaction_date_month | latest_transaction_date_day | latest_transaction_date_absday | latest_expire_date_year | latest_expire_date_month | latest_expire_date_day | latest_expire_date_absday | latest_trans_vs_expire | latest_trans_vs_log | latest_log_vs_expire | registration_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
x5+AtzOZKAtnAJBsCIdAyiRl+1p9nIvAYchIkS4zaS4= | 22 | 31 | 9 | 20150202 | 0 | 748 | 2 | 2 | 1 | 330.174 | 330.174 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 179 | 44.750000 | 4 | 41022.435 | 10255.608750 | 31 | 7.750000 | 4 | 1.000000 | 0 | 0.000000 | 6 | 1.500000 | 153 | 38.250000 | 193 | 38.600000 | 5 | 43158.296 | 8631.659200 | 34 | 6.800000 | 11 | 2.200000 | 0 | 0.000000 | 6 | 1.200000 | 158 | 31.600000 | 275 | 22.916667 | 12 | 59410.879 | 4950.906583 | 52 | 4.333333 | 18 | 1.500000 | 7 | 0.583333 | 7 | 0.583333 | 214 | 17.833333 | 1687 | 33.740000 | 50 | 372126.228 | 7442.524560 | 371 | 7.420000 | 79 | 1.580000 | 52 | 1.040000 | 50 | 1.000000 | 1303 | 26.060000 | 4329 | 36.686441 | 118 | 997354.598 | 8452.157610 | 719 | 6.093220 | 137 | 1.161017 | 92 | 0.779661 | 118 | 1.000000 | 3595 | 30.466102 | 8460 | 35.696203 | 237 | 1947010.556 | 8215.234414 | 1507 | 6.358650 | 272 | 1.147679 | 155 | 0.654008 | 171 | 0.721519 | 7136 | 30.109705 | 21497 | 39.883117 | 539 | 5178191.657 | 9607.034614 | 3037 | 5.634508 | 686 | 1.272727 | 365 | 0.677180 | 367 | 0.680891 | 19132 | 35.495362 | 555 | 2682 | 4.832432 | 0 | 0.0 | 27 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 19 | 6259 | 2015 | 2 | 2 | 5511 | 0 | 1 | 2017 | 2 | 1 | 6241 | 2017 | 2 | 28 | 6268 | -27 | -18 | -9 | 2015 |
WiVvUGUuxmRviEX69svzHUC/zhpyJZdAm3ZyExXsjHA= | 17 | 40 | 9 | 20071006 | 0 | 787 | 46 | 46 | 1 | 11779.310 | 11779.310 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 46 | 46 | 105 | 26.250000 | 4 | 26890.522 | 6722.630500 | 2 | 0.500000 | 0 | 0.000000 | 2 | 0.500000 | 0 | 0.000000 | 104 | 26.000000 | 145 | 20.714286 | 7 | 36302.890 | 5186.127143 | 7 | 1.000000 | 1 | 0.142857 | 2 | 0.285714 | 2 | 0.285714 | 137 | 19.571429 | 276 | 25.090909 | 11 | 55003.399 | 5000.309000 | 67 | 6.090909 | 29 | 2.636364 | 9 | 0.818182 | 6 | 0.545455 | 188 | 17.090909 | 1425 | 24.568966 | 58 | 322053.197 | 5552.641328 | 275 | 4.741379 | 170 | 2.931034 | 74 | 1.275862 | 40 | 0.689655 | 1101 | 18.982759 | 2771 | 23.091667 | 120 | 596865.459 | 4973.878825 | 617 | 5.141667 | 347 | 2.891667 | 140 | 1.166667 | 110 | 0.916667 | 2002 | 16.683333 | 6125 | 26.864035 | 228 | 1388082.365 | 6088.080548 | 1567 | 6.872807 | 538 | 2.359649 | 255 | 1.118421 | 229 | 1.004386 | 4862 | 21.324561 | 26479 | 43.408197 | 610 | 6661356.419 | 10920.256425 | 5362 | 8.790164 | 2650 | 4.344262 | 1323 | 2.168852 | 1318 | 2.160656 | 23024 | 37.744262 | 543 | 2831 | 5.213628 | 0 | 0.0 | 39 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 27 | 6267 | 2015 | 1 | 2 | 5480 | 1 | 0 | 2017 | 1 | 31 | 6240 | 2017 | 3 | 12 | 6280 | -40 | -27 | -13 | 2007 |
ur0rGRoV2XJOYpNbzl5n/jBEV9PrKDwZX4QeO03gXl8= | 6 | 21 | 4 | 20160822 | 0 | 189 | 36 | 36 | 1 | 9240.939 | 9240.939 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 35 | 35 | 371 | 61.833333 | 6 | 107030.392 | 17838.398667 | 7 | 1.166667 | 3 | 0.500000 | 0 | 0.000000 | 1 | 0.166667 | 415 | 69.166667 | 756 | 68.727273 | 11 | 220080.304 | 20007.300364 | 58 | 5.272727 | 27 | 2.454545 | 5 | 0.454545 | 7 | 0.636364 | 843 | 76.636364 | 1766 | 65.407407 | 27 | 590614.385 | 21874.606852 | 138 | 5.111111 | 74 | 2.740741 | 15 | 0.555556 | 16 | 0.592593 | 2275 | 84.259259 | 5486 | 66.096386 | 83 | 1899764.373 | 22888.727386 | 493 | 5.939759 | 116 | 1.397590 | 44 | 0.530120 | 41 | 0.493976 | 7351 | 88.566265 | 11591 | 68.994048 | 168 | 3408784.787 | 20290.385637 | 1131 | 6.732143 | 198 | 1.178571 | 90 | 0.535714 | 160 | 0.952381 | 13134 | 78.178571 | 12290 | 69.044944 | 178 | 3576194.197 | 20090.978635 | 1290 | 7.247191 | 205 | 1.151685 | 99 | 0.556180 | 178 | 1.000000 | 13798 | 77.516854 | 12290 | 69.044944 | 178 | 3576194.197 | 20090.978635 | 1290 | 7.247191 | 205 | 1.151685 | 99 | 0.556180 | 178 | 1.000000 | 13798 | 77.516854 | 210 | 1043 | 4.966667 | 0 | 0.0 | 39 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 27 | 6267 | 2016 | 8 | 22 | 6078 | 1 | 0 | 2017 | 1 | 31 | 6240 | 2017 | 3 | 26 | 6294 | -54 | -27 | -27 | 2016 |
M/PccoJW/A9myX+eCodcY8Z4LMD1r+d6YKzUNv4PMZo= | 1 | 0 | 7 | 20130610 | 0 | 789 | 15 | 15 | 1 | 3416.639 | 3416.639 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 16 | 204 | 29.142857 | 7 | 88005.151 | 12572.164429 | 5 | 0.714286 | 3 | 0.428571 | 2 | 0.285714 | 2 | 0.285714 | 395 | 56.428571 | 441 | 31.500000 | 14 | 168815.462 | 12058.247286 | 52 | 3.714286 | 31 | 2.214286 | 20 | 1.428571 | 6 | 0.428571 | 735 | 52.500000 | 781 | 26.033333 | 30 | 303585.197 | 10119.506567 | 118 | 3.933333 | 91 | 3.033333 | 41 | 1.366667 | 10 | 0.333333 | 1282 | 42.733333 | 1984 | 27.178082 | 73 | 581820.141 | 7970.138918 | 217 | 2.972603 | 178 | 2.438356 | 69 | 0.945205 | 24 | 0.328767 | 2343 | 32.095890 | 3501 | 25.554745 | 137 | 932913.407 | 6809.586912 | 373 | 2.722628 | 282 | 2.058394 | 106 | 0.773723 | 50 | 0.364964 | 3661 | 26.722628 | 7250 | 26.459854 | 274 | 1818035.906 | 6635.167540 | 768 | 2.802920 | 504 | 1.839416 | 196 | 0.715328 | 116 | 0.423358 | 7089 | 25.872263 | 17809 | 27.440678 | 649 | 4533034.736 | 6984.645202 | 1352 | 2.083205 | 937 | 1.443760 | 336 | 0.517720 | 198 | 0.305085 | 17432 | 26.859784 | 840 | 3631 | 4.322619 | 0 | 0.0 | 41 | 30 | 99 | 99 | 1 | 0 | 3.300000 | 2017 | 2 | 28 | 6268 | 2015 | 1 | 1 | 5479 | 0 | 0 | 2017 | 2 | 24 | 6264 | 2017 | 3 | 25 | 6293 | -29 | -4 | -25 | 2013 |
BZbN3U+ghA0lwOA34yF/GNHbJb73T48nEZGHc4bcikc= | 9 | 31 | 9 | 20080410 | 0 | 788 | 25 | 25 | 1 | 13113.138 | 13113.138 | 14 | 14 | 1 | 1 | 5 | 5 | 3 | 3 | 93 | 93 | 101 | 16.833333 | 6 | 33808.385 | 5634.730833 | 29 | 4.833333 | 2 | 0.333333 | 12 | 2.000000 | 6 | 1.000000 | 165 | 27.500000 | 157 | 13.083333 | 12 | 48485.571 | 4040.464250 | 40 | 3.333333 | 4 | 0.333333 | 14 | 1.166667 | 9 | 0.750000 | 219 | 18.250000 | 484 | 21.043478 | 23 | 139573.866 | 6068.428957 | 97 | 4.217391 | 21 | 0.913043 | 23 | 1.000000 | 20 | 0.869565 | 554 | 24.086957 | 1416 | 22.838710 | 62 | 380409.666 | 6135.639774 | 213 | 3.435484 | 51 | 0.822581 | 50 | 0.806452 | 59 | 0.951613 | 1583 | 25.532258 | 4484 | 30.093960 | 149 | 1179486.041 | 7916.013698 | 791 | 5.308725 | 162 | 1.087248 | 143 | 0.959732 | 259 | 1.738255 | 4752 | 31.892617 | 8142 | 26.607843 | 306 | 2174747.697 | 7107.018618 | 1684 | 5.503268 | 343 | 1.120915 | 312 | 1.019608 | 519 | 1.696078 | 8373 | 27.362745 | 12775 | 20.942623 | 610 | 3287076.077 | 5388.649307 | 2773 | 4.545902 | 602 | 0.986885 | 543 | 0.890164 | 926 | 1.518033 | 12292 | 20.150820 | 480 | 3278 | 6.829167 | 0 | 0.0 | 34 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 27 | 6267 | 2015 | 1 | 1 | 5479 | 1 | 0 | 2017 | 2 | 28 | 6268 | 2017 | 3 | 31 | 6299 | -31 | 1 | -32 | 2008 |
#Reduce test set to 6% churn
test_churn_split_factor = (churn_rate_test * churn_rate_actual) / (1 - churn_rate_actual)
dummy, test_sub = train_test_split(test[test.is_churn==1], test_size=test_churn_split_factor)
# test = pd.concat([test[test.is_churn==0], test_sub], ignore_index=True)
test = pd.concat([test[test.is_churn==0], test_sub])
# Test
test.head()
city | bd | registered_via | registration_init_time | is_churn | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_is_cancel | latest_amount_paid_per_day | date_featuresdatemax_date_year | date_featuresdatemax_date_month | date_featuresdatemax_date_day | date_featuresdatemax_date_absday | date_featuresdatemin_date_year | date_featuresdatemin_date_month | date_featuresdatemin_date_day | date_featuresdatemin_date_absday | female | male | latest_transaction_date_year | latest_transaction_date_month | latest_transaction_date_day | latest_transaction_date_absday | latest_expire_date_year | latest_expire_date_month | latest_expire_date_day | latest_expire_date_absday | latest_trans_vs_expire | latest_trans_vs_log | latest_log_vs_expire | registration_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pGYTw4lPrmjHN9lLR7InnPhzwrtsEI+1auPYpgAk6b0= | 1 | 0 | 7 | 20160108 | 0 | 363 | 2 | 2 | 1 | 368.661 | 368.661 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 24 | 12.000000 | 2 | 11875.661 | 5937.830500 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0.000000 | 1 | 0.500000 | 57 | 28.500000 | 46 | 15.333333 | 3 | 20711.377 | 6903.792333 | 1 | 0.333333 | 1 | 0.333333 | 0 | 0.000000 | 1 | 0.333333 | 99 | 33.000000 | 190 | 14.615385 | 13 | 56847.506 | 4372.885077 | 8 | 0.615385 | 3 | 0.230769 | 2 | 0.153846 | 4 | 0.307692 | 267 | 20.538462 | 268 | 14.888889 | 18 | 81202.505 | 4511.250278 | 14 | 0.777778 | 4 | 0.222222 | 5 | 0.277778 | 9 | 0.500000 | 377 | 20.944444 | 290 | 13.181818 | 22 | 84854.309 | 3857.014045 | 20 | 0.909091 | 8 | 0.363636 | 8 | 0.363636 | 10 | 0.454545 | 386 | 17.545455 | 605 | 14.069767 | 43 | 158621.515 | 3688.872442 | 55 | 1.279070 | 21 | 0.488372 | 20 | 0.465116 | 19 | 0.441860 | 663 | 15.418605 | 605 | 14.069767 | 43 | 158621.515 | 3688.872442 | 55 | 1.279070 | 21 | 0.488372 | 20 | 0.465116 | 19 | 0.441860 | 663 | 15.418605 | 420 | 1386 | 3.300000 | 0 | 0.0 | 41 | 30 | 99 | 99 | 1 | 0 | 3.300000 | 2017 | 1 | 23 | 6232 | 2016 | 1 | 26 | 5869 | 0 | 0 | 2017 | 2 | 8 | 6248 | 2017 | 3 | 8 | 6276 | -28 | 16 | -44 | 2016 |
BLQQH7Gf4iOW4DhlyTBC1YwJgYXEgYkq/0L8VCO/w74= | 15 | 28 | 7 | 20130111 | 0 | 227 | 21 | 21 | 1 | 7849.636 | 7849.636 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 31 | 31 | 95 | 23.750000 | 4 | 28433.210 | 7108.302500 | 0 | 0.000000 | 1 | 0.250000 | 2 | 0.500000 | 4 | 1.000000 | 106 | 26.500000 | 158 | 15.800000 | 10 | 48280.951 | 4828.095100 | 16 | 1.600000 | 7 | 0.700000 | 2 | 0.200000 | 14 | 1.400000 | 173 | 17.300000 | 458 | 17.615385 | 26 | 164517.024 | 6327.577846 | 44 | 1.692308 | 33 | 1.269231 | 13 | 0.500000 | 43 | 1.653846 | 590 | 22.692308 | 1819 | 27.149254 | 67 | 513447.609 | 7663.397149 | 226 | 3.373134 | 139 | 2.074627 | 86 | 1.283582 | 97 | 1.447761 | 1808 | 26.985075 | 4124 | 32.992000 | 125 | 1148673.053 | 9189.384424 | 435 | 3.480000 | 219 | 1.752000 | 147 | 1.176000 | 194 | 1.552000 | 4106 | 32.848000 | 6030 | 37.453416 | 161 | 1588085.229 | 9863.883410 | 780 | 4.844720 | 540 | 3.354037 | 303 | 1.881988 | 273 | 1.695652 | 5519 | 34.279503 | 6030 | 37.453416 | 161 | 1588085.229 | 9863.883410 | 780 | 4.844720 | 540 | 3.354037 | 303 | 1.881988 | 273 | 1.695652 | 5519 | 34.279503 | 277 | 1043 | 3.765343 | 0 | 0.0 | 40 | 30 | 149 | 149 | 1 | 1 | 4.966667 | 2017 | 2 | 28 | 6268 | 2016 | 7 | 16 | 6041 | 0 | 1 | 2017 | 2 | 21 | 6261 | 2017 | 2 | 20 | 6260 | 1 | -7 | 8 | 2013 |
zm8t3xu/h5PxWJZw6A88Dp1lrzIdEPmqQkKVGhsVzpU= | 13 | 47 | 7 | 20161203 | 0 | 43 | 8 | 8 | 1 | 463.075 | 463.075 | 6 | 6 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 8 | 8.000000 | 1 | 463.075 | 463.075000 | 6 | 6.000000 | 0 | 0.000000 | 1 | 1.000000 | 0 | 0.000000 | 1 | 1.000000 | 47 | 7.833333 | 6 | 4970.152 | 828.358667 | 38 | 6.333333 | 4 | 0.666667 | 1 | 0.166667 | 3 | 0.500000 | 15 | 2.500000 | 214 | 11.888889 | 18 | 38819.765 | 2156.653611 | 105 | 5.833333 | 18 | 1.000000 | 7 | 0.388889 | 11 | 0.611111 | 148 | 8.222222 | 464 | 17.185185 | 27 | 105892.311 | 3921.937444 | 286 | 10.592593 | 53 | 1.962963 | 15 | 0.555556 | 23 | 0.851852 | 411 | 15.222222 | 464 | 17.185185 | 27 | 105892.311 | 3921.937444 | 286 | 10.592593 | 53 | 1.962963 | 15 | 0.555556 | 23 | 0.851852 | 411 | 15.222222 | 464 | 17.185185 | 27 | 105892.311 | 3921.937444 | 286 | 10.592593 | 53 | 1.962963 | 15 | 0.555556 | 23 | 0.851852 | 411 | 15.222222 | 464 | 17.185185 | 27 | 105892.311 | 3921.937444 | 286 | 10.592593 | 53 | 1.962963 | 15 | 0.555556 | 23 | 0.851852 | 411 | 15.222222 | 120 | 396 | 3.300000 | 0 | 0.0 | 41 | 30 | 99 | 99 | 1 | 0 | 3.300000 | 2017 | 2 | 27 | 6267 | 2017 | 1 | 15 | 6224 | 1 | 0 | 2017 | 2 | 12 | 6252 | 2017 | 3 | 12 | 6280 | -28 | -15 | -13 | 2016 |
oGtvKgIb+1vvcTTPdZWFyeyoUchFtc+9D+KOfR+DIdg= | 1 | 0 | 7 | 20160106 | 0 | 213 | 11 | 11 | 1 | 502.463 | 502.463 | 8 | 8 | 3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 7.000000 | 2 | 982.412 | 491.206000 | 9 | 4.500000 | 3 | 1.500000 | 0 | 0.000000 | 1 | 0.500000 | 1 | 0.500000 | 25 | 8.333333 | 3 | 2976.693 | 992.231000 | 11 | 3.666667 | 4 | 1.333333 | 0 | 0.000000 | 2 | 0.666667 | 8 | 2.666667 | 26 | 6.500000 | 4 | 2981.738 | 745.434500 | 12 | 3.000000 | 4 | 1.000000 | 0 | 0.000000 | 2 | 0.500000 | 8 | 2.000000 | 196 | 16.333333 | 12 | 47610.576 | 3967.548000 | 15 | 1.250000 | 6 | 0.500000 | 4 | 0.333333 | 4 | 0.333333 | 183 | 15.250000 | 1672 | 23.222222 | 72 | 370688.635 | 5148.453264 | 256 | 3.555556 | 91 | 1.263889 | 44 | 0.611111 | 32 | 0.444444 | 1405 | 19.513889 | 2073 | 20.126214 | 103 | 453956.260 | 4407.342330 | 384 | 3.728155 | 126 | 1.223301 | 74 | 0.718447 | 49 | 0.475728 | 1685 | 16.359223 | 2073 | 20.126214 | 103 | 453956.260 | 4407.342330 | 384 | 3.728155 | 126 | 1.223301 | 74 | 0.718447 | 49 | 0.475728 | 1685 | 16.359223 | 390 | 1287 | 3.300000 | 0 | 0.0 | 41 | 30 | 99 | 99 | 1 | 0 | 3.300000 | 2016 | 8 | 6 | 6062 | 2016 | 1 | 6 | 5849 | 0 | 0 | 2017 | 2 | 5 | 6245 | 2017 | 3 | 5 | 6273 | -28 | 183 | -211 | 2016 |
shHx7K5hJ3W50FoA4BTEQfSyVcuqCidkjCtY21FdTLs= | 22 | 29 | 3 | 20150916 | 0 | 531 | 42 | 42 | 1 | 8183.695 | 8183.695 | 10 | 10 | 1 | 1 | 3 | 3 | 5 | 5 | 24 | 24 | 232 | 33.142857 | 7 | 52512.478 | 7501.782571 | 51 | 7.285714 | 6 | 0.857143 | 8 | 1.142857 | 15 | 2.142857 | 207 | 29.571429 | 505 | 42.083333 | 12 | 121942.448 | 10161.870667 | 64 | 5.333333 | 12 | 1.000000 | 14 | 1.166667 | 19 | 1.583333 | 510 | 42.500000 | 806 | 36.636364 | 22 | 181926.222 | 8269.373727 | 182 | 8.272727 | 22 | 1.000000 | 23 | 1.045455 | 30 | 1.363636 | 724 | 32.909091 | 1418 | 21.164179 | 67 | 406302.004 | 6064.209015 | 318 | 4.746269 | 56 | 0.835821 | 39 | 0.582090 | 60 | 0.895522 | 1548 | 23.104478 | 2369 | 16.451389 | 144 | 734855.804 | 5103.165306 | 468 | 3.250000 | 94 | 0.652778 | 63 | 0.437500 | 80 | 0.555556 | 2760 | 19.166667 | 5487 | 19.052083 | 288 | 1727199.616 | 5997.220889 | 1096 | 3.805556 | 199 | 0.690972 | 157 | 0.545139 | 188 | 0.652778 | 6452 | 22.402778 | 5930 | 17.390029 | 341 | 1887033.172 | 5533.821619 | 1299 | 3.809384 | 228 | 0.668622 | 190 | 0.557185 | 221 | 0.648094 | 6986 | 20.486804 | 540 | 2682 | 4.966667 | 0 | 0.0 | 40 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 28 | 6268 | 2015 | 9 | 16 | 5737 | 0 | 1 | 2017 | 2 | 27 | 6267 | 2017 | 3 | 26 | 6294 | -27 | -1 | -26 | 2015 |
#Split data / labels
train_labels = train['is_churn']
train_data = train.drop('is_churn', axis=1)
dev_labels = dev['is_churn']
dev_data = dev.drop('is_churn', axis=1)
test_labels = test['is_churn']
test_data = test.drop('is_churn', axis=1)
# Validation
print('\nCheck data sizes:')
print('Train data / test: ', train_data.shape, train_labels.shape)
print(' Dev data / test: ', dev_data.shape, dev_labels.shape)
print(' Test data / test: ', test_data.shape, test_labels.shape)
#Baseline (if we guess all 0's, this is what we get)
print('\nBaseline Accuracy (dev): {:.2%}'.format(1-(dev['is_churn'].sum() / dev['is_churn'].count())))
print('Baseline Accuracy (test): {:.2%}'.format(1-(test['is_churn'].sum() / test['is_churn'].count())))
Check data sizes:
Train data / test: (53126, 159) (53126,)
Dev data / test: (11681, 159) (11681,)
Test data / test: (7024, 159) (7024,)
Baseline Accuracy (dev): 94.05%
Baseline Accuracy (test): 94.09%
Of note, the overall data has a churn rate of roughly 6% (~6% of users churn, ~94% stay). However, because we want our model to train well on both churned and non-churned users, this data set is split roughly 50/50 between churn and non-churn users. So we use this 50/50 split to on our 'train' data set, but we reduce it back to 6/94 for both our 'dev' and 'test' data sets, by removing most of the churn cases after data is split into train, dev, and test. In our initial models, we had not yet peformed the 50/50 spilt of the training data, and our best recall score (of the dev data) was 78%. Upon making this change to 50/50, our recall score of our best models improved dramattically, to 96%!
Having split our data, we perform some quick inspections:
dev_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 11681 entries, x5+AtzOZKAtnAJBsCIdAyiRl+1p9nIvAYchIkS4zaS4= to IDkK5VQYefRBzy2GAJgs2ChDorWoKcIBPrnBQGOimbA=
Columns: 159 entries, city to registration_time
dtypes: float64(61), int64(96), uint8(2)
memory usage: 14.1+ MB
dev_data.isnull().sum(axis=0)
city 0
bd 0
registered_via 0
registration_init_time 0
date_featuresdatelistening_tenure 0
within_days_1num_unqsum 0
within_days_1num_unqmean 0
within_days_1num_unqcount 0
within_days_1total_secssum 0
within_days_1total_secsmean 0
within_days_1num_25sum 0
within_days_1num_25mean 0
within_days_1num_50sum 0
within_days_1num_50mean 0
within_days_1num_75sum 0
within_days_1num_75mean 0
within_days_1num_985sum 0
within_days_1num_985mean 0
within_days_1num_100sum 0
within_days_1num_100mean 0
within_days_7num_unqsum 0
within_days_7num_unqmean 0
within_days_7num_unqcount 0
within_days_7total_secssum 0
within_days_7total_secsmean 0
within_days_7num_25sum 0
within_days_7num_25mean 0
within_days_7num_50sum 0
within_days_7num_50mean 0
within_days_7num_75sum 0
within_days_7num_75mean 0
within_days_7num_985sum 0
within_days_7num_985mean 0
within_days_7num_100sum 0
within_days_7num_100mean 0
within_days_14num_unqsum 0
within_days_14num_unqmean 0
within_days_14num_unqcount 0
within_days_14total_secssum 0
within_days_14total_secsmean 0
within_days_14num_25sum 0
within_days_14num_25mean 0
within_days_14num_50sum 0
within_days_14num_50mean 0
within_days_14num_75sum 0
within_days_14num_75mean 0
within_days_14num_985sum 0
within_days_14num_985mean 0
within_days_14num_100sum 0
within_days_14num_100mean 0
within_days_31num_unqsum 0
within_days_31num_unqmean 0
within_days_31num_unqcount 0
within_days_31total_secssum 0
within_days_31total_secsmean 0
within_days_31num_25sum 0
within_days_31num_25mean 0
within_days_31num_50sum 0
within_days_31num_50mean 0
within_days_31num_75sum 0
within_days_31num_75mean 0
within_days_31num_985sum 0
within_days_31num_985mean 0
within_days_31num_100sum 0
within_days_31num_100mean 0
within_days_90num_unqsum 0
within_days_90num_unqmean 0
within_days_90num_unqcount 0
within_days_90total_secssum 0
within_days_90total_secsmean 0
within_days_90num_25sum 0
within_days_90num_25mean 0
within_days_90num_50sum 0
within_days_90num_50mean 0
within_days_90num_75sum 0
within_days_90num_75mean 0
within_days_90num_985sum 0
within_days_90num_985mean 0
within_days_90num_100sum 0
within_days_90num_100mean 0
within_days_180num_unqsum 0
within_days_180num_unqmean 0
within_days_180num_unqcount 0
within_days_180total_secssum 0
within_days_180total_secsmean 0
within_days_180num_25sum 0
within_days_180num_25mean 0
within_days_180num_50sum 0
within_days_180num_50mean 0
within_days_180num_75sum 0
within_days_180num_75mean 0
within_days_180num_985sum 0
within_days_180num_985mean 0
within_days_180num_100sum 0
within_days_180num_100mean 0
within_days_365num_unqsum 0
within_days_365num_unqmean 0
within_days_365num_unqcount 0
within_days_365total_secssum 0
within_days_365total_secsmean 0
within_days_365num_25sum 0
within_days_365num_25mean 0
within_days_365num_50sum 0
within_days_365num_50mean 0
within_days_365num_75sum 0
within_days_365num_75mean 0
within_days_365num_985sum 0
within_days_365num_985mean 0
within_days_365num_100sum 0
within_days_365num_100mean 0
within_days_9999num_unqsum 0
within_days_9999num_unqmean 0
within_days_9999num_unqcount 0
within_days_9999total_secssum 0
within_days_9999total_secsmean 0
within_days_9999num_25sum 0
within_days_9999num_25mean 0
within_days_9999num_50sum 0
within_days_9999num_50mean 0
within_days_9999num_75sum 0
within_days_9999num_75mean 0
within_days_9999num_985sum 0
within_days_9999num_985mean 0
within_days_9999num_100sum 0
within_days_9999num_100mean 0
total_plan_days 0
total_amount_paid 0
amount_paid_per_day 0
diff_renewal_duration 0
diff_plan_amount_paid_per_day 0
latest_payment_method_id 0
latest_plan_days 0
latest_plan_list_price 0
latest_amount_paid 0
latest_auto_renew 0
latest_is_cancel 0
latest_amount_paid_per_day 0
date_featuresdatemax_date_year 0
date_featuresdatemax_date_month 0
date_featuresdatemax_date_day 0
date_featuresdatemax_date_absday 0
date_featuresdatemin_date_year 0
date_featuresdatemin_date_month 0
date_featuresdatemin_date_day 0
date_featuresdatemin_date_absday 0
female 0
male 0
latest_transaction_date_year 0
latest_transaction_date_month 0
latest_transaction_date_day 0
latest_transaction_date_absday 0
latest_expire_date_year 0
latest_expire_date_month 0
latest_expire_date_day 0
latest_expire_date_absday 0
latest_trans_vs_expire 0
latest_trans_vs_log 0
latest_log_vs_expire 0
registration_time 0
dtype: int64
dev_data.describe(include='all')
city | bd | registered_via | registration_init_time | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_is_cancel | latest_amount_paid_per_day | date_featuresdatemax_date_year | date_featuresdatemax_date_month | date_featuresdatemax_date_day | date_featuresdatemax_date_absday | date_featuresdatemin_date_year | date_featuresdatemin_date_month | date_featuresdatemin_date_day | date_featuresdatemin_date_absday | female | male | latest_transaction_date_year | latest_transaction_date_month | latest_transaction_date_day | latest_transaction_date_absday | latest_expire_date_year | latest_expire_date_month | latest_expire_date_day | latest_expire_date_absday | latest_trans_vs_expire | latest_trans_vs_log | latest_log_vs_expire | registration_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 11681.000000 | 11681.000000 | 11681.000000 | 1.168100e+04 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.0 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.00000 | 11681.000000 | 11681.000000 | 1.168100e+04 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 1.168100e+04 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 1.168100e+04 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 1.168100e+04 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 1.168100e+04 | 1.168100e+04 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 1.168100e+04 | 1.168100e+04 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 | 11681.000000 |
mean | 5.826727 | 13.443712 | 6.935536 | 2.013228e+07 | 514.284479 | 22.295009 | 22.295009 | 1.0 | 5841.428795 | 5841.428795 | 5.029535 | 5.029535 | 1.189967 | 1.189967 | 0.743258 | 0.743258 | 0.839055 | 0.839055 | 22.308963 | 22.308963 | 111.650372 | 22.691406 | 4.167109 | 30503.988013 | 5985.177092 | 23.840938 | 5.122138 | 5.884342 | 1.257663 | 3.699854 | 0.775567 | 4.381988 | 0.882703 | 116.822104 | 22.752969 | 211.99221 | 22.719871 | 7.785720 | 5.828727e+04 | 6014.422805 | 45.381132 | 5.193618 | 11.183289 | 1.276905 | 7.068487 | 0.781544 | 8.186114 | 0.874601 | 223.294067 | 22.854394 | 452.812602 | 23.127815 | 16.307337 | 1.236928e+05 | 6085.946547 | 97.905830 | 5.332152 | 24.019262 | 1.308603 | 15.254601 | 0.806773 | 17.757298 | 0.900429 | 473.865679 | 23.104524 | 1318.713552 | 23.924663 | 46.052050 | 3.594315e+05 | 6305.510455 | 291.725880 | 5.636081 | 70.684017 | 1.378342 | 45.068659 | 0.848118 | 51.324972 | 0.930059 | 1373.024655 | 23.880023 | 2541.436093 | 24.333497 | 87.509545 | 6.874662e+05 | 6385.243171 | 573.163428 | 5.812711 | 137.585053 | 1.428488 | 87.280284 | 0.871057 | 99.230717 | 0.948863 | 2611.563993 | 24.068495 | 4755.631624 | 24.737577 | 161.709357 | -3.948022e+12 | -1.532762e+10 | 1078.856776 | 5.960090 | 258.348087 | 1.471240 | 162.010273 | 0.885842 | 183.662272 | 0.956174 | 4859.338670 | 24.379361 | 8451.857803 | 25.175002 | 279.380104 | -3.963815e+14 | -7.873946e+11 | 1888.557315 | 6.049168 | 455.579402 | 1.507384 | 283.676826 | 0.901029 | 319.495163 | 0.963754 | 8641.108980 | 24.792486 | 485.222327 | 2256.692492 | 4.543509 | -0.072768 | 0.012520 | 38.756956 | 35.180978 | 152.782210 | 152.550381 | 0.874583 | 0.027395 | 4.379098 | 2016.894444 | 2.516480 | 24.400822 | 6241.423166 | 2015.381217 | 4.273264 | 10.564506 | 5727.138687 | 0.211797 | 0.235254 | 2016.962503 | 2.059755 | 17.265902 | 6245.262135 | 2016.997089 | 2.901293 | 16.254944 | 6280.439346 | -35.177211 | 3.838969 | -39.016180 | 2013.161031 |
std | 6.345094 | 18.346119 | 1.899316 | 2.984823e+04 | 270.362724 | 27.289473 | 27.289473 | 0.0 | 8142.475535 | 8142.475535 | 12.678597 | 12.678597 | 2.871737 | 2.871737 | 1.536068 | 1.536068 | 2.135428 | 2.135428 | 33.871304 | 33.871304 | 136.367988 | 21.627585 | 2.148945 | 43854.406776 | 6856.053431 | 44.162816 | 9.053400 | 10.632452 | 2.175187 | 5.571651 | 1.087507 | 14.222665 | 2.166218 | 184.086949 | 28.805492 | 258.14289 | 20.281673 | 4.420964 | 8.320901e+04 | 6553.882438 | 78.772433 | 8.337512 | 18.067398 | 1.992995 | 10.110025 | 0.975385 | 22.794720 | 1.916294 | 347.684326 | 27.606710 | 535.120295 | 18.870724 | 9.586211 | 1.718042e+05 | 6076.212087 | 157.448568 | 7.413234 | 34.991492 | 1.694762 | 20.295165 | 0.858649 | 44.748413 | 1.836051 | 723.190582 | 25.730783 | 1517.857170 | 18.160695 | 27.728096 | 4.829295e+05 | 5968.782229 | 530.007545 | 7.367072 | 96.768111 | 1.495532 | 58.757332 | 0.825757 | 103.773302 | 1.558319 | 2005.149579 | 25.270982 | 2919.897164 | 17.656294 | 54.891995 | 9.095613e+05 | 5808.165290 | 1734.932914 | 10.458405 | 189.308082 | 1.479933 | 111.811107 | 0.802158 | 214.083712 | 1.619215 | 3713.410676 | 24.500432 | 5632.153118 | 17.528812 | 109.365613 | 1.907921e+14 | 7.496272e+11 | 5003.389049 | 14.538201 | 373.301292 | 1.474342 | 212.337775 | 0.765136 | 346.952630 | 1.211172 | 6845.962816 | 23.693800 | 10989.815430 | 17.273936 | 222.393064 | 4.817483e+15 | 8.540636e+12 | 8821.608529 | 13.306385 | 713.431408 | 1.477271 | 407.737201 | 0.756774 | 564.771146 | 1.035766 | 13023.760298 | 23.003918 | 230.848005 | 1230.002058 | 1.051432 | 28.009957 | 0.528422 | 3.769322 | 38.381381 | 162.935104 | 163.028018 | 0.331206 | 0.163238 | 0.969957 | 0.357509 | 2.062157 | 6.319689 | 89.576130 | 0.544543 | 3.891219 | 9.562939 | 255.133800 | 0.408599 | 0.424176 | 0.209692 | 1.358014 | 9.357640 | 47.726121 | 0.056964 | 0.405357 | 9.199672 | 20.838048 | 40.945475 | 100.327736 | 89.066643 | 2.987074 |
min | 1.000000 | 0.000000 | 3.000000 | 2.004033e+07 | 0.000000 | 1.000000 | 1.000000 | 1.0 | 0.255000 | 0.255000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.255000 | 0.255000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.00000 | 1.000000 | 1.000000 | 2.550000e-01 | 0.255000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 7.620000e-01 | 0.762000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 7.620000e-01 | 0.762000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 7.620000e-01 | 0.762000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -9.223372e+15 | -4.521261e+13 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | -3.135946e+17 | -4.077954e+14 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 0.000000 | 0.000000 | -400.000000 | -6.000000 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 5482.000000 | 2015.000000 | 1.000000 | 1.000000 | 5479.000000 | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 5729.000000 | 2015.000000 | 1.000000 | 1.000000 | 5791.000000 | -522.000000 | -538.000000 | -813.000000 | 2004.000000 |
25% | 1.000000 | 0.000000 | 7.000000 | 2.012021e+07 | 274.000000 | 5.000000 | 5.000000 | 1.0 | 1035.354000 | 1035.354000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 3.000000 | 22.000000 | 8.666667 | 2.000000 | 5144.756000 | 1989.096750 | 3.000000 | 1.000000 | 1.000000 | 0.200000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 18.000000 | 6.800000 | 43.00000 | 10.000000 | 4.000000 | 1.006196e+04 | 2297.794167 | 6.000000 | 1.333333 | 1.000000 | 0.333333 | 1.000000 | 0.200000 | 1.000000 | 0.200000 | 35.000000 | 8.000000 | 100.000000 | 11.307692 | 8.000000 | 2.419783e+04 | 2669.606818 | 16.000000 | 1.727273 | 4.000000 | 0.458333 | 3.000000 | 0.300000 | 3.000000 | 0.307692 | 85.000000 | 9.333333 | 309.000000 | 12.625000 | 21.000000 | 7.378727e+04 | 3015.113367 | 53.000000 | 2.117647 | 14.000000 | 0.571429 | 9.000000 | 0.387097 | 9.000000 | 0.392157 | 258.000000 | 10.553846 | 578.000000 | 13.242857 | 38.000000 | 1.407210e+05 | 3138.226153 | 106.000000 | 2.295082 | 28.000000 | 0.619048 | 19.000000 | 0.422535 | 19.000000 | 0.425926 | 499.000000 | 11.021978 | 1011.000000 | 13.879630 | 63.000000 | 2.432913e+05 | 3.271425e+03 | 192.000000 | 2.485714 | 52.000000 | 0.666667 | 34.000000 | 0.452128 | 34.000000 | 0.452586 | 865.000000 | 11.553398 | 1440.000000 | 14.409774 | 82.000000 | 2.978761e+05 | 3.232108e+03 | 283.000000 | 2.615385 | 77.000000 | 0.707483 | 49.000000 | 0.473180 | 49.000000 | 0.470721 | 1220.000000 | 11.852590 | 300.000000 | 1192.000000 | 3.300000 | 0.000000 | 0.000000 | 38.000000 | 30.000000 | 99.000000 | 99.000000 | 1.000000 | 0.000000 | 3.300000 | 2017.000000 | 2.000000 | 24.000000 | 6262.000000 | 2015.000000 | 1.000000 | 2.000000 | 5480.000000 | 0.000000 | 0.000000 | 2017.000000 | 2.000000 | 9.000000 | 6243.000000 | 2017.000000 | 3.000000 | 8.000000 | 6273.000000 | -31.000000 | -22.000000 | -29.000000 | 2012.000000 |
50% | 1.000000 | 0.000000 | 7.000000 | 2.014052e+07 | 554.000000 | 13.000000 | 13.000000 | 1.0 | 3004.783000 | 3004.783000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 11.000000 | 11.000000 | 68.000000 | 17.000000 | 4.000000 | 16648.129000 | 4122.421714 | 10.000000 | 2.833333 | 3.000000 | 0.714286 | 2.000000 | 0.500000 | 2.000000 | 0.500000 | 61.000000 | 15.000000 | 130.00000 | 17.750000 | 8.000000 | 3.209375e+04 | 4286.554286 | 21.000000 | 3.000000 | 6.000000 | 0.818182 | 4.000000 | 0.538462 | 4.000000 | 0.571429 | 118.000000 | 15.800000 | 291.000000 | 18.705882 | 17.000000 | 7.235517e+04 | 4544.279786 | 49.000000 | 3.500000 | 13.000000 | 0.916667 | 9.000000 | 0.611111 | 9.000000 | 0.625000 | 264.000000 | 16.800000 | 868.000000 | 19.787234 | 47.000000 | 2.160170e+05 | 4801.588723 | 156.000000 | 3.873239 | 41.000000 | 1.000000 | 27.000000 | 0.666667 | 28.000000 | 0.683333 | 793.000000 | 17.769231 | 1667.000000 | 20.263158 | 87.000000 | 4.120246e+05 | 4944.808905 | 301.000000 | 4.098765 | 81.000000 | 1.042553 | 53.000000 | 0.689189 | 55.000000 | 0.710227 | 1513.000000 | 18.123967 | 3051.000000 | 20.703180 | 153.000000 | 7.594609e+05 | 5.081905e+03 | 560.000000 | 4.229050 | 152.000000 | 1.100000 | 99.000000 | 0.713178 | 101.000000 | 0.733813 | 2786.000000 | 18.671875 | 4851.000000 | 21.420779 | 228.000000 | 1.129275e+06 | 5.141641e+03 | 894.000000 | 4.393484 | 246.000000 | 1.145631 | 157.000000 | 0.737609 | 162.000000 | 0.749035 | 4455.000000 | 19.231343 | 480.000000 | 2117.000000 | 4.850000 | 0.000000 | 0.000000 | 41.000000 | 30.000000 | 149.000000 | 149.000000 | 1.000000 | 0.000000 | 4.966667 | 2017.000000 | 2.000000 | 27.000000 | 6267.000000 | 2015.000000 | 2.000000 | 7.000000 | 5684.000000 | 0.000000 | 0.000000 | 2017.000000 | 2.000000 | 18.000000 | 6253.000000 | 2017.000000 | 3.000000 | 16.000000 | 6283.000000 | -28.000000 | -10.000000 | -19.000000 | 2014.000000 |
75% | 13.000000 | 27.000000 | 9.000000 | 2.016011e+07 | 786.000000 | 29.000000 | 29.000000 | 1.0 | 7237.710000 | 7237.710000 | 5.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 27.000000 | 27.000000 | 150.000000 | 29.714286 | 6.000000 | 38096.915000 | 7444.710333 | 28.000000 | 6.000000 | 7.000000 | 1.500000 | 5.000000 | 1.000000 | 5.000000 | 1.000000 | 142.000000 | 28.000000 | 283.00000 | 29.142857 | 12.000000 | 7.219161e+04 | 7302.134333 | 55.000000 | 6.214286 | 14.000000 | 1.555556 | 9.000000 | 1.000000 | 10.000000 | 1.076923 | 269.000000 | 27.750000 | 604.000000 | 29.310345 | 25.000000 | 1.554336e+05 | 7393.735423 | 120.000000 | 6.538462 | 30.000000 | 1.625000 | 20.000000 | 1.033333 | 21.000000 | 1.111111 | 579.000000 | 27.950000 | 1795.000000 | 29.746835 | 71.000000 | 4.604427e+05 | 7549.245559 | 359.000000 | 6.902439 | 92.000000 | 1.702703 | 59.000000 | 1.059524 | 64.000000 | 1.129870 | 1717.000000 | 28.437500 | 3468.000000 | 30.159722 | 136.000000 | 8.872624e+05 | 7643.410298 | 708.000000 | 7.084507 | 180.000000 | 1.763514 | 116.000000 | 1.080882 | 122.000000 | 1.152000 | 3306.000000 | 28.680556 | 6457.000000 | 30.614493 | 255.000000 | 1.655791e+06 | 7.835299e+03 | 1332.000000 | 7.212766 | 337.000000 | 1.812500 | 215.000000 | 1.104046 | 229.000000 | 1.161616 | 6188.000000 | 29.495098 | 11311.000000 | 31.182957 | 446.000000 | 2.804311e+06 | 7.882326e+03 | 2293.000000 | 7.281139 | 591.000000 | 1.862069 | 375.000000 | 1.121495 | 399.000000 | 1.175532 | 10945.000000 | 29.963964 | 720.000000 | 3354.000000 | 5.165333 | 0.000000 | 0.000000 | 41.000000 | 30.000000 | 149.000000 | 149.000000 | 1.000000 | 0.000000 | 4.966667 | 2017.000000 | 2.000000 | 28.000000 | 6268.000000 | 2016.000000 | 8.000000 | 18.000000 | 5936.000000 | 0.000000 | 0.000000 | 2017.000000 | 2.000000 | 26.000000 | 6262.000000 | 2017.000000 | 3.000000 | 24.000000 | 6291.000000 | -28.000000 | 0.000000 | -8.000000 | 2016.000000 |
max | 22.000000 | 942.000000 | 13.000000 | 2.017022e+07 | 789.000000 | 530.000000 | 530.000000 | 1.0 | 121364.969000 | 121364.969000 | 570.000000 | 570.000000 | 97.000000 | 97.000000 | 47.000000 | 47.000000 | 117.000000 | 117.000000 | 507.000000 | 507.000000 | 2101.000000 | 322.500000 | 7.000000 | 604487.476000 | 121364.969000 | 2328.000000 | 338.000000 | 354.000000 | 66.000000 | 109.000000 | 26.500000 | 1207.000000 | 172.428571 | 3205.000000 | 507.000000 | 4287.00000 | 322.500000 | 14.000000 | 1.188867e+06 | 121364.969000 | 3623.000000 | 338.000000 | 574.000000 | 66.000000 | 197.000000 | 19.000000 | 1877.000000 | 156.416667 | 5765.000000 | 507.000000 | 8345.000000 | 322.500000 | 31.000000 | 2.418630e+06 | 81498.962379 | 7309.000000 | 338.000000 | 1131.000000 | 66.000000 | 403.000000 | 15.454545 | 3492.000000 | 151.826087 | 13493.000000 | 449.766667 | 23425.000000 | 275.588235 | 90.000000 | 7.056073e+06 | 102261.927435 | 35029.000000 | 393.584270 | 3017.000000 | 33.898876 | 1207.000000 | 21.785714 | 6953.000000 | 124.160714 | 35188.000000 | 443.492754 | 29562.000000 | 200.588235 | 180.000000 | 1.376347e+07 | 87110.569297 | 166936.000000 | 932.603352 | 8311.000000 | 46.430168 | 3287.000000 | 18.363128 | 14946.000000 | 133.446429 | 60842.000000 | 426.651163 | 77035.000000 | 249.500000 | 365.000000 | 2.178917e+07 | 7.669030e+04 | 519451.000000 | 1427.063187 | 19246.000000 | 52.873626 | 8016.000000 | 22.021978 | 15637.000000 | 65.701681 | 96565.000000 | 426.651163 | 154375.000000 | 212.053571 | 789.000000 | 1.424409e+09 | 3.050126e+06 | 911417.000000 | 1298.314815 | 37859.000000 | 53.930199 | 16436.000000 | 23.413105 | 17188.000000 | 32.588378 | 251662.000000 | 360.992857 | 1200.000000 | 5669.000000 | 13.450000 | 420.000000 | 6.000000 | 41.000000 | 450.000000 | 1788.000000 | 1788.000000 | 1.000000 | 1.000000 | 6.000000 | 2017.000000 | 12.000000 | 31.000000 | 6268.000000 | 2017.000000 | 12.000000 | 31.000000 | 6268.000000 | 1.000000 | 1.000000 | 2017.000000 | 12.000000 | 31.000000 | 6268.000000 | 2017.000000 | 12.000000 | 31.000000 | 6299.000000 | 46.000000 | 782.000000 | 353.000000 | 2017.000000 |
3. Predictive Modeling!
With our data in good shape, we move on to build predictive models.
We begin by building a couple functions to help automate the evaluation of our models:
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion Matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
Documented here as it is in the source.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized Confusion Matrix")
else:
print('Confusion Matrix')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, size=20)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45, size=20)
plt.yticks(tick_marks, classes, size=20)
fmt = '.1%' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt), size=20,
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label', size=20)
plt.xlabel('Predicted label', size=20)
def summarize_results(classifier, data=dev_data, labels=dev_labels):
"""Function to automate the displaying modeling results.
Args:
classifier (a sklearn classifier): The classifier to plot.
Kwargs:
data (dataframe): The data on which to predict labels. Defaults to dev_data.
labels (dataframe): The correct labels. Defaults to dev_labels.
Returns:
None, but prints and plots summary metrics.
"""
#Print Results
print('Accuracy: {:.2%}'.format(classifier.score(data, labels)))
print(classification_report(labels, classifier.predict(data)))
#Plot Results
class_names = [0, 1]
# Compute confusion matrix
cnf_matrix = confusion_matrix(labels, classifier.predict(data))
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion Matrix')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
title='Normalized Confusion Matrix')
plt.show()
Model Evaluation
We're placing an emphasis on recall as our primary metric, moreso than accuracy. Our thinking here is that accuracy has a 'baseline' of 94% (predicting all 0's, i.e., no users churn), making our current best prediction of ~98% much less impressive. Moreover, we're okay with some false positives but would prefer to minimize false negatives. In other words, we'd rather predict a few customers as likely to churn when in fact they would actually stay (false positives) as opposed to predicting customers who would stay but who actually churn (false negatives). This assumption presumes that the long-term cost of keeping customers (for example, the cost of offering discounts) is less than the long-term loss associated with losing customers. Admittedly, more domain knowledge would be required to validate this assumption, but we consider that validation beyond the scope of the project.
In summary, though we calculate several evaluation metrics below, recall is our primary scoring metric, so long as we have a reasonably low False Positive rate.
Poorly Performing Classifiers
We initially tried a few different models:
- Gaussian Naive Bayes
- K-Nearest Neighbors
- Support Vector Machines
None of these had promising results, as shown in the output of the cell below.
Note, though not shown here, the team explored several tuning options with these classifiers, but none of them performed as well as the classifiers further down.
### NB Attempt ###
clf_NB_Gauss = GaussianNB()
clf_NB_Gauss.fit(train_data, train_labels)
print('NAIVE BAYES CLASSIFIER')
summarize_results(clf_NB_Gauss)
### KNN Attempt ###
print('KNN CLASSIFIER')
clf_neigh = KNeighborsClassifier(n_neighbors=10, n_jobs=8) #Accuracy plateaus around n=10, all 0's
clf_neigh.fit(train_data, train_labels)
summarize_results(clf_neigh)
#### SVM Attempts ###
print('SVM CLASSIFIER')
clf_SVM = svm.SVC(kernel='rbf', C=1, max_iter=640, probability = True) #max_iter=635 gives 6% accuracy ... need new approach / tuning
clf_SVM.fit(train_data, train_labels)
summarize_results(clf_SVM)
NAIVE BAYES CLASSIFIER
Accuracy: 93.97%
precision recall f1-score support
0 0.94 1.00 0.97 10986
1 0.09 0.00 0.00 695
avg / total 0.89 0.94 0.91 11681
Confusion Matrix
[[10976 10]
[ 694 1]]
Normalized Confusion Matrix
[[9.99e-01 9.10e-04]
[9.99e-01 1.44e-03]]
KNN CLASSIFIER
Accuracy: 67.01%
precision recall f1-score support
0 0.95 0.68 0.80 10986
1 0.08 0.45 0.14 695
avg / total 0.90 0.67 0.76 11681
Confusion Matrix
[[7512 3474]
[ 379 316]]
Normalized Confusion Matrix
[[0.68 0.32]
[0.55 0.45]]
SVM CLASSIFIER
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:218: ConvergenceWarning: Solver terminated early (max_iter=640). Consider pre-processing your data with StandardScaler or MinMaxScaler.
% self.max_iter, ConvergenceWarning)
Accuracy: 6.06%
precision recall f1-score support
0 1.00 0.00 0.00 10986
1 0.06 1.00 0.11 695
avg / total 0.94 0.06 0.01 11681
Confusion Matrix
[[ 13 10973]
[ 0 695]]
Normalized Confusion Matrix
[[0. 1.]
[0. 1.]]
Random Forest Classifier
Having had little success with the classifiers above, we next tried a random forest, which performed very well:
### Random Forest Attempt ###
clf_RF = RandomForestClassifier(n_jobs=8, n_estimators=23)
clf_RF.fit(train_data, train_labels)
summarize_results(clf_RF)
print(clf_RF.get_params())
Accuracy: 97.40%
precision recall f1-score support
0 1.00 0.97 0.99 10986
1 0.70 0.98 0.82 695
avg / total 0.98 0.97 0.98 11681
Confusion Matrix
[[10696 290]
[ 14 681]]
Normalized Confusion Matrix
[[0.97 0.03]
[0.02 0.98]]
{'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 23, 'n_jobs': 8, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
Random Forest with GridSearchCV
We next tried Random Forest with GridSearchCV to further tune our parameters:
#RF Classifier with Grid Search
tuned_parameters = [{'n_estimators': [150],
'max_features': [20],
'min_samples_leaf': [2],
}]
clf_GS_RF = GridSearchCV(RandomForestClassifier(n_jobs=8),
tuned_parameters,
#cv=4,
scoring='recall')
clf_GS_RF.fit(train_data, train_labels)
pprint(clf_GS_RF.grid_scores_)
pprint(clf_GS_RF.best_estimator_)
pprint(clf_GS_RF.best_params_)
summarize_results(clf_GS_RF)
[mean: 0.97473, std: 0.00126, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 150}]
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features=20, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=2, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=8,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
{'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 150}
Accuracy: 98.13%
precision recall f1-score support
0 1.00 0.97 0.99 10986
1 0.70 0.98 0.82 695
avg / total 0.98 0.97 0.98 11681
Confusion Matrix
[[10699 287]
[ 13 682]]
Normalized Confusion Matrix
[[0.97 0.03]
[0.02 0.98]]
The best parameters turned out to be max_features = 20, min_samples_leaf = 2, and n_estimators = 150, which produced a recall of 0.96890.
Of these parameters, n_estimators seemed to have the most effect on model performance, but it was still fairly small, and none of the parameters showed much difference (our total range was 0.0027, from 0.9662 - 0.9689).
To keep the run time down, we removed all the tuning trials for running multiple tuning parameters. However, the results of those trials are listed below:
Output of 'print(clf_GS_RF.grid_scores_)':
Tuning max_features, min_samples_leaf, and n_estimators
- mean: 0.96619, std: 0.00077, params: {'max_features': 20, 'min_samples_leaf': 1, 'n_estimators': 40},
- mean: 0.96663, std: 0.00076, params: {'max_features': 20, 'min_samples_leaf': 1, 'n_estimators': 50},
- mean: 0.96782, std: 0.00066, params: {'max_features': 20, 'min_samples_leaf': 1, 'n_estimators': 100},
- mean: 0.96660, std: 0.00143, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 40},
- mean: 0.96786, std: 0.00028, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 50},
- mean: 0.96868, std: 0.00055, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 100},
- mean: 0.96816, std: 0.00048, params: {'max_features': 20, 'min_samples_leaf': 4, 'n_estimators': 40},
- mean: 0.96838, std: 0.00060, params: {'max_features': 20, 'min_samples_leaf': 4, 'n_estimators': 50},
- mean: 0.96853, std: 0.00101, params: {'max_features': 20, 'min_samples_leaf': 4, 'n_estimators': 100},
- mean: 0.96704, std: 0.00057, params: {'max_features': 20, 'min_samples_leaf': 8, 'n_estimators': 40},
- mean: 0.96786, std: 0.00032, params: {'max_features': 20, 'min_samples_leaf': 8, 'n_estimators': 50},
- mean: 0.96734, std: 0.00089, params: {'max_features': 20, 'min_samples_leaf': 8, 'n_estimators': 100},
- mean: 0.96704, std: 0.00016, params: {'max_features': 40, 'min_samples_leaf': 1, 'n_estimators': 40},
- mean: 0.96753, std: 0.00068, params: {'max_features': 40, 'min_samples_leaf': 1, 'n_estimators': 50},
- mean: 0.96819, std: 0.00080, params: {'max_features': 40, 'min_samples_leaf': 1, 'n_estimators': 100},
- mean: 0.96838, std: 0.00024, params: {'max_features': 40, 'min_samples_leaf': 2, 'n_estimators': 40},
- mean: 0.96819, std: 0.00064, params: {'max_features': 40, 'min_samples_leaf': 2, 'n_estimators': 50},
- mean: 0.96860, std: 0.00048, params: {'max_features': 40, 'min_samples_leaf': 2, 'n_estimators': 100},
- mean: 0.96868, std: 0.00046, params: {'max_features': 40, 'min_samples_leaf': 4, 'n_estimators': 40},
- mean: 0.96860, std: 0.00009, params: {'max_features': 40, 'min_samples_leaf': 4, 'n_estimators': 50},
- mean: 0.96827, std: 0.00027, params: {'max_features': 40, 'min_samples_leaf': 4, 'n_estimators': 100},
- mean: 0.96834, std: 0.00032, params: {'max_features': 40, 'min_samples_leaf': 8, 'n_estimators': 40},
- mean: 0.96860, std: 0.00069, params: {'max_features': 40, 'min_samples_leaf': 8, 'n_estimators': 50},
- mean: 0.96864, std: 0.00059, params: {'max_features': 40, 'min_samples_leaf': 8, 'n_estimators': 100}]
Further tuning n_estimators
- mean: 0.96842, std: 0.00073, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 100},
- mean: 0.96890, std: 0.00082, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 150},
- mean: 0.96838, std: 0.00078, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 200}]
XGBoost Classifier
Having seen strong performance from Random Forest models, we next tried an XGBoost classifier:
#Basic XGB Classifier
clf_XGB = xgboost.XGBClassifier(n_jobs=8)
clf_XGB.fit(train_data, train_labels)
summarize_results(clf_XGB)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
Accuracy: 97.11%
precision recall f1-score support
0 1.00 0.97 0.98 10986
1 0.68 0.97 0.80 695
avg / total 0.98 0.97 0.97 11681
Confusion Matrix
[[10667 319]
[ 18 677]]
Normalized Confusion Matrix
[[0.97 0.03]
[0.03 0.97]]
The results of the XGBoost classifier were also very promising. With no tuning, they weren't quite as good as the Random Forest, but they were very close.
XGBoost with GridSearchCV
We next tried XGBoost with GridSearchCV to further tune our parameters:
#XGB Classifier with Grid Search
tuned_parameters = [{'reg_lambda': [0.01],
#'learning_rate': [0.01, 0.1, 1],
#'max_depth': [3, 5, 7, 9],
'max_depth': [5], #Landed on 5
#'min_child_weight': [1, 3, 5],
'min_child_weight': [1], #Landed on 1
#'gamma':[i/10.0 for i in range(0,5)],
'gamma':[0.01], #Landed on 0.01
#'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100], #NEXT TRY THIS, BUT NOT WITH COMBO ABOVE
'reg_alpha':[0.1],
}]
clf_GS_XGB = GridSearchCV(xgboost.XGBClassifier(n_jobs=8),
tuned_parameters,
#cv=4,
scoring='recall')
clf_GS_XGB.fit(train_data, train_labels)
summarize_results(clf_GS_XGB)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
Accuracy: 97.84%
precision recall f1-score support
0 1.00 0.98 0.99 10986
1 0.71 0.98 0.83 695
avg / total 0.98 0.98 0.98 11681
Confusion Matrix
[[10713 273]
[ 15 680]]
Normalized Confusion Matrix
[[0.98 0.02]
[0.02 0.98]]
From the code and results shown above, we were able to get results that slightly exceeded the Random Forest model.
To keep the run time down, we commented out the cells that ran multiple tuning parameters. However, the results of those trials are as follows:
Output of 'print(clf_GS_XGB.grid_scores_)':
Tuning max depth and min child weight
- mean: 0.96971, std: 0.00072, params: {'max_depth': 3, 'min_child_weight': 1, 'reg_lambda': 0.01}
- mean: 0.96990, std: 0.00064, params: {'max_depth': 3, 'min_child_weight': 3, 'reg_lambda': 0.01}
- mean: 0.96990, std: 0.00061, params: {'max_depth': 3, 'min_child_weight': 5, 'reg_lambda': 0.01}
- mean: 0.97012, std: 0.00023, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_lambda': 0.01}
- mean: 0.97042, std: 0.00014, params: {'max_depth': 5, 'min_child_weight': 3, 'reg_lambda': 0.01}
- mean: 0.96983, std: 0.00055, params: {'max_depth': 5, 'min_child_weight': 5, 'reg_lambda': 0.01}
- mean: 0.96994, std: 0.00048, params: {'max_depth': 7, 'min_child_weight': 1, 'reg_lambda': 0.01}
- mean: 0.97038, std: 0.00125, params: {'max_depth': 7, 'min_child_weight': 3, 'reg_lambda': 0.01}
- mean: 0.97031, std: 0.00026, params: {'max_depth': 7, 'min_child_weight': 5, 'reg_lambda': 0.01}
- mean: 0.97038, std: 0.00096, params: {'max_depth': 9, 'min_child_weight': 1, 'reg_lambda': 0.01}
- mean: 0.97057, std: 0.00115, params: {'max_depth': 9, 'min_child_weight': 3, 'reg_lambda': 0.01}
- mean: 0.97038, std: 0.00024, params: {'max_depth': 9, 'min_child_weight': 5, 'reg_lambda': 0.01}
Tuning reg_alpha (with optimal values from above)
- mean: 0.97005, std: 0.00018, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 1e-05, 'reg_lambda': 0.01}
- mean: 0.97012, std: 0.00052, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.01, 'reg_lambda': 0.01}
- mean: 0.97016, std: 0.00016, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
- mean: 0.97009, std: 0.00045, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 1, 'reg_lambda': 0.01}
- mean: 0.96738, std: 0.00057, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 100, 'reg_lambda': 0.01}
Tuning gamma (with optimal values from above)
- mean: 0.97016, std: 0.00016, params: {'gamma': 0.0, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
- mean: 0.97020, std: 0.00029, params: {'gamma': 0.1, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
- mean: 0.97001, std: 0.00028, params: {'gamma': 0.2, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
- mean: 0.97016, std: 0.00083, params: {'gamma': 0.3, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
- mean: 0.97012, std: 0.00056, params: {'gamma': 0.4, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
print(clf_GS_XGB.score(dev_data, dev_labels))
0.9784172661870504
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
Final Run on Test Data
The cell below runs our best model on the not-yet-touched test data:
summarize_results(clf_GS_XGB, test_data, test_labels)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
Accuracy: 98.31%
precision recall f1-score support
0 1.00 0.98 0.99 6609
1 0.71 0.98 0.83 415
avg / total 0.98 0.98 0.98 7024
Confusion Matrix
[[6444 165]
[ 7 408]]
Normalized Confusion Matrix
[[0.98 0.02]
[0.02 0.98]]
The test results confirm the same strong findings we saw in the dev data.
Modeling Results
Here are the key points summarizing our predictive modeling findings:
- XGBoost worked the best, slightly outperforming Random Forest. No other model we tried came close to their results.
- Scores:
- We achieved a recall of 97.8% with the dev data, and 98.3% with the test data (not used to tune any of the models), which resulted in correctly predicting 680 / 408 users who churned and only missing 15 / 7 in the dev / test data, respectively.
- Our false positive rate was 2.5% in both the dev and test data, incorrectly predicting 273 / 165 users who actually did not churn in the dev / test data, respectively. The economic modeling below will give more insight into if these levels are acceptable, but they seem quite good for now.
- In our initial models, we had not yet peformed the 50/50 spilt of the training data, and our best recall score (of the dev data) was 78%. Upon making this change to 50/50, our recall score of our best models improved dramattically, to 96%!
- Additional feature engineering proved useful also. Notably, adding features of date interactions (expiry date, last transaction, and last usage, along with the differences among these dates), reduced false positives in our dev data from 415 to 273, a big improvement.
4. Calculating Probabilities
Our model has shown very promising results both in terms of recall and accuracy, meaning we can accurately predict which customers will churn. However from a business perspective, we would also like to go further and look at the probability of churn, in order to determine how much should be spent to prevent churn.
When looking at probility we want to ensure that it is accurately calibrated (ie when the model predicts 60% probability of churn, 60% of customers did churn). We can do this by creating a calibration plot.
When we performed the train/dev/test split, we kept a 50% churn proportion in the train dataset, but a 6% (native) churn proportion in both the dev and test splits. Therefore, because we trained on the train dataset, but our proportions are very different between train/dev datasets, our calibration is very innacurate:
def plot_calibration(models, testing_data, testing_labels, title):
""" This function plots calibration plot using sklearn packages in order to visualize
the efficiency in which models predict probabilities. It is based on the work presented
in the sklearn documentation (http://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html).
Also prints the brier score for reference.
Args:
models: A list of tuples that contain the pre-fit model (with predict_proba as a method)
as well as a string of the name of the model
testing_data: Data used for testing the pre-fit model. **Data should not have previously
been used for testing
testing_labels: Labels for the testing data
title: String to be used as title for plot
Returns:
N/A: Prints plot to screen
"""
plt.figure(figsize=(9, 9))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0,1], [0,1], 'k:', label='Perfect Calibration')
for clf, name in models:
#Get probabilities for the specific model using the test dataset
prob_pos = clf.predict_proba(testing_data)[:,1]
#Use sklearn calibraiton_curve implementation to extract data to plot in calibration curve
frac_pos, mean_pred = calibration_curve(testing_labels, prob_pos, n_bins = 10)
ax1.plot(mean_pred, frac_pos, "s-", label='%s' %(name,))
ax2.hist(prob_pos, range=(0,1), bins=10, label=name, histtype='step', lw=2)
#Print the Brier Score - used for quantifying calibration success
print("%s Brier Score: %1.3f" %(name, brier_score_loss(testing_labels, prob_pos)))
ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title(title)
ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)
plt.tight_layout()
#Baseline score using models that were previously computed to optimize for recall and accuracy
model = [(clf_NB_Gauss, 'Naive Bayes'),
(clf_neigh, 'Nearest Neighbors'),
(clf_RF, 'Random Forest'),
(clf_SVM, 'Support Vector Machine'),
(clf_XGB, 'XG Boost'),
(clf_GS_XGB, 'XG Boost Optimized')]
title = 'Calibration Plot for Previously Computed Models'
plot_calibration(model, dev_data, dev_labels, title)
Naive Bayes Brier Score: 0.060
Nearest Neighbors Brier Score: 0.243
Random Forest Brier Score: 0.027
Support Vector Machine Brier Score: 0.256
XG Boost Brier Score: 0.025
XG Boost Optimized Brier Score: 0.022
Because of the very different proportion of churn data between the train and dev dataset, our calibration curves for all models performs very poorly. We therefore need to run a calibration model (trained on the dev set - which hasn't been used for training and provides the correct proportion of churn). Training using the dev set, we can then determine calibration efficiency by looking at the test set which until now has not been utilized. Due to the limited size of our dataset, we will train on the dev set, however an additional subset of data that hasn't previously been used would be a preferred approach and is recommended for future development.
We can see the affect of different proportion churn datsets for train and test in the histogram underneath the calibration curve. We know that in the dev set 94% of the data is labeled 0 (so should have a small probability) however there are peaks around the 50% mark especially in the SVM model. This distorts the probability computation and causes the models to be poorly calibrated.
We will start by training using the built in sklearn packages which utilizes Platt's scaling (fitting a logistic regression model) or isotonic calibration procedures.
#Fit isotonic and sigmoid calibration to the XG Boost Model
clf_isotonic = CalibratedClassifierCV(clf_GS_XGB, cv = 2, method = 'isotonic')
clf_isotonic.fit(dev_data, dev_labels)
clf_sigmoid = CalibratedClassifierCV(clf_GS_XGB, cv = 2, method = 'sigmoid')
clf_sigmoid.fit(dev_data, dev_labels)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
CalibratedClassifierCV(base_estimator=GridSearchCV(cv=None, error_score='raise',
estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jo....01], 'reg_alpha': [0.1]}],
pre_dispatch='2*n_jobs', refit=True, scoring='recall', verbose=0),
cv=2, method='sigmoid')
The sklearn calibration functions utilize the dataset in raw form (rather than feeding in predicted probabilities from the prior model). Below when we implement our own model, we train on the former predicted model probabilities (clf_GS_XGB) to compute the calibrated model.
model = [(clf_GS_XGB, 'XG Boost'),
(clf_isotonic, 'XGB Isotonic Calibration'),
(clf_sigmoid, 'XGB Sigmoid Calibration')]
title = 'Calibration Plot for Isotonic and Sigmoid Calibration Functions'
plot_calibration(model, test_data, test_labels, title)
XG Boost Brier Score: 0.021
XGB Isotonic Calibration Brier Score: 0.013
XGB Sigmoid Calibration Brier Score: 0.015
We can see that both calibration procedures significantly improved over the non-calibrated XG-Boost method (we are calibrating on XG Boost because this provided the highest accuracy and recall). The isotonic calibration worked better than sigmoid.
From the histogram plot we can see that most predictions have a low probability, which makes sense based on the skew in the dev and test datasets (only 6% churn).
We then wanted to test other methods of calibration to determine if our own implementation would result in improved calibration.
#Initialize models used for calibration
clf_LR_class = LogisticRegression()
clf_RF_class = RandomForestClassifier()
clf_NB_class = GaussianNB()
clf_SVM_class = svm.SVC(kernel='rbf', probability = True)
#Get probability data from the former model (XG Boost) in order to fit the calibration model
probs_dev = clf_GS_XGB.predict_proba(dev_data)[:, 1]
probs_dev = probs_dev.reshape(-1, 1)
probs_test = clf_GS_XGB.predict_proba(test_data)[:,1]
probs_test = probs_test.reshape(-1,1)
model = [(clf_LR_class, 'Logistic Regression Calibration'),
(clf_RF_class, 'Random Forrest Calibration'),
(clf_NB_class, 'Naive Bayes Calibration'),
(clf_SVM_class, 'SVM Calibration')]
#Fit the models used for calibration with the probability output from XG Boost
for clf, name in model:
clf.fit(probs_dev, dev_labels)
title = 'Calibration Plot for Implementation of Calibrated Models using other ML Models'
plot_calibration(model, probs_test, test_labels, title)
Logistic Regression Calibration Brier Score: 0.013
Random Forrest Calibration Brier Score: 0.017
Naive Bayes Calibration Brier Score: 0.020
SVM Calibration Brier Score: 0.011
Here we can see that our implementation, using the default hyperparameters for the sklearn modeling packages, improved the baseline accuracy (clf_GS_XGB). Additionally, while visually all lines appear to be less calibrated than the former calibration steps (isotonic and sigmoid), the brier score for the SVM implementation was actually lower (better) than the isotonic implementation. The dataset here was limited (5000 samples) and visually isotonic does appear to follow the 45 deg line more closely than SVM. We can see that in several locations SVM is very close to the 45 deg line which may help to bring down the brier score calculation on this limited dataset, but overall performance would be hindered.
Because the intent of the calibration is to feed an economic model that will analyze probability of churn over a range of probabilities, we will utilize the isotonic calibration model as it stays closs to the 45 deg line throughout the range of probabilities where we would like to recommend action taken to prevent churn.
Future work should involve larger datasets to validate the use of the isotonic calibration or support an alternative method.
We can now use our calibrated probabilities to feed our economic model and allow business insight into the problem of business churn.
5. Economic Impact
Economic model to plug into the business plan
To this point we have the following information:
*Users who could churn (from the model)
*Probability for churn (from the calibrated model)
*Spending metrics (from the data)
Next step would be to guide the business on worthwhile spending to keep the customers at risk. We'll come up with a model for this spend (marketing spend) and then apply it to our data. The marketing spend can be used for loyalty programs, incentives or other kinds of tiered discounting programs. We'll keep the spend formfactor out of scope for this report.
For the economic analysis, data has been provided in NTD (New Taiwan Dollars). This will be the currency used throughout this analysis.
Coming up with the economic model for retention
We'll start with some metrics for the customer and business that are visible from the data or our feature list
* Optimum lifetime value of a customer is assumed to be the revenue from the highest paying customer (for the purpose of this report). We calculate 2 metrics for this:
* Optimum lifetime value -OLTV- max paid/day from our sample
* Optimum lietime value (3 years) - OLTV3y
* Lifetime value of a customer is the actual revenue from the customer. We again consider 2 metrics for this:
* Life time value - LTV - actuals from the customer/day
* Life time value (3 years) - LTV3y
* Average lifetime value of a customer is the average what's paid by our sample customers. The 2 metrics for this are:
* Average life time value - ALTV - average/day
* Average life time value (3 years) - altv3y
The Average 3 year life value of a customer can help us make assumptions on what can be spend to acuire the customer in the first place (CAC). We're assuming that this value is 10% of the average lifetime value *CAC = .1 * altv3y
Now know the value (revenue) of our customers and what it cost to acquire them. We'll move on to find out the spend to keep them. CAC is relevant as it helps us establish a ceiling for our retention/reacquisition cost (RAC) spend. We'll assert that the reacquistion cost for the customer cannot exceed 75% of the original acquisition cost.
We'd not want to spend the "reacquistion $" equally on all customers. We'd want to optimize this spend based on the forllowing:
* Value of the customer. ltv/oltv (lifetime value of a customer/lifetime value of our optimum customer) is a good representation of a customer value. It simplistic as leaves out all intangibles like social and support cost impact of some customers. We can easily extend our model for those factors
* Risk of flight. The probability of churn is a good represention of this
Combining all of the above, we arrive at the following model for reacqusition spending (RAC)
** RAC = .75 * CAC * POC * (LTV/OLTV)**
We'll calculate this value individually for each customer.
An enchancement to our recommendation can be finding RAC clusters to suggest tiers of spending. We've not attempted that in this report.
Probabilities from our model
# Get the probabilities for economic model
# We'll use the calibrated model
# Prediction probabilities
predictions_prob = clf_isotonic.predict_proba(dev_data)
# Predictions
predictions = clf_isotonic.predict(dev_data)
# Test
print ('''
Dev data shape: {}
Predictions shape: {}
Sample of predictions: {}
Prediction probabilities shape: {}
Sample of prediction probabilities:
{}
'''.format(dev_data.shape, predictions.shape, predictions[:5], predictions_prob.shape ,predictions_prob[:5])
)
Dev data shape: (11681, 159)
Predictions shape: (11681,)
Sample of predictions: [0 0 0 0 0]
Prediction probabilities shape: (11681, 2)
Sample of prediction probabilities:
[[9.91e-01 8.57e-03]
[9.98e-01 1.70e-03]
[9.99e-01 1.18e-03]
[1.00e+00 0.00e+00]
[1.00e+00 2.04e-04]]
# Starting the marketing data frame
marketing_data = dev_data.copy(deep = True)
# Test
print('''
Marketing_data shape: {}
'''.format(marketing_data.shape)
)
Marketing_data shape: (11681, 159)
# Adding predictions and probability (From the model)
marketing_data['predictions'] = predictions
marketing_data['probability_churn'] = predictions_prob[:,1]
Lifetime value metrics for the customers
# Optimum life time value (per day ) can be represented by the max of "amount paid per day" among our sample
oltv = marketing_data['amount_paid_per_day'].max()
oltv3y = oltv * 365 * 3
print('''
Optimum lifetime value of the customer per day (NTD/day): NT${:,.2f}
Optimum lifetime value of the customer (3 Years) (NTD): NT${:,.2f}
'''.format(oltv, oltv3y)
)
Optimum lifetime value of the customer per day (NTD/day): NT$13.45
Optimum lifetime value of the customer (3 Years) (NTD): NT$14,727.75
# Average lifetime value (ALTV) can be represented by spend/day
# Over 3 years it helps us calculate the cost to acquire the customer (CAC)
# The business plan allows CAC to be 10% of revenue from the customer
altv = marketing_data['amount_paid_per_day'].mean()
altv3y = marketing_data['amount_paid_per_day'].mean() * 365 * 3
cac = .1 * altv3y
print('''
Average lifetime value of the customer per day (NTD/day): NT${:,.4f}
Average lifetime value of the customer: (3 Years) (NTD): NT${:,.4f}
Customer acquisition cost (NTD): NT${:,.4f}
'''.format(altv, altv3y, cac)
)
Average lifetime value of the customer per day (NTD/day): NT$4.5435
Average lifetime value of the customer: (3 Years) (NTD): NT$4,975.1425
Customer acquisition cost (NTD): NT$497.5143
Adding the reacquisition cost suggestion to our model
RAC = .75 * CAC * POC * (LTV/OLTV)
# Reacquistion cost (marketing spend to prevent churn):
# RAC = .75 * CAC * POC * (LTV/OLTV)
print('''
Shape of predictions: {}
Shape of predictions_prob: {}
Shape of marketing_data['amount_paid_per_day']: {}
'''.format(predictions_prob[:,1].shape, predictions.shape, marketing_data['amount_paid_per_day'].shape)
)
marketing_data['proposed_spend'] = ( .75 * cac
* predictions_prob[:,1]
* predictions
* (marketing_data['amount_paid_per_day']/oltv)
)
Shape of predictions: (11681,)
Shape of predictions_prob: (11681,)
Shape of marketing_data['amount_paid_per_day']: (11681,)
# Test
print(marketing_data.shape)
marketing_data.head()
(11681, 162)
city | bd | registered_via | registration_init_time | date_featuresdatelistening_tenure | within_days_1num_unqsum | within_days_1num_unqmean | within_days_1num_unqcount | within_days_1total_secssum | within_days_1total_secsmean | within_days_1num_25sum | within_days_1num_25mean | within_days_1num_50sum | within_days_1num_50mean | within_days_1num_75sum | within_days_1num_75mean | within_days_1num_985sum | within_days_1num_985mean | within_days_1num_100sum | within_days_1num_100mean | within_days_7num_unqsum | within_days_7num_unqmean | within_days_7num_unqcount | within_days_7total_secssum | within_days_7total_secsmean | within_days_7num_25sum | within_days_7num_25mean | within_days_7num_50sum | within_days_7num_50mean | within_days_7num_75sum | within_days_7num_75mean | within_days_7num_985sum | within_days_7num_985mean | within_days_7num_100sum | within_days_7num_100mean | within_days_14num_unqsum | within_days_14num_unqmean | within_days_14num_unqcount | within_days_14total_secssum | within_days_14total_secsmean | within_days_14num_25sum | within_days_14num_25mean | within_days_14num_50sum | within_days_14num_50mean | within_days_14num_75sum | within_days_14num_75mean | within_days_14num_985sum | within_days_14num_985mean | within_days_14num_100sum | within_days_14num_100mean | within_days_31num_unqsum | within_days_31num_unqmean | within_days_31num_unqcount | within_days_31total_secssum | within_days_31total_secsmean | within_days_31num_25sum | within_days_31num_25mean | within_days_31num_50sum | within_days_31num_50mean | within_days_31num_75sum | within_days_31num_75mean | within_days_31num_985sum | within_days_31num_985mean | within_days_31num_100sum | within_days_31num_100mean | within_days_90num_unqsum | within_days_90num_unqmean | within_days_90num_unqcount | within_days_90total_secssum | within_days_90total_secsmean | within_days_90num_25sum | within_days_90num_25mean | within_days_90num_50sum | within_days_90num_50mean | within_days_90num_75sum | within_days_90num_75mean | within_days_90num_985sum | within_days_90num_985mean | within_days_90num_100sum | within_days_90num_100mean | within_days_180num_unqsum | within_days_180num_unqmean | within_days_180num_unqcount | within_days_180total_secssum | within_days_180total_secsmean | within_days_180num_25sum | within_days_180num_25mean | within_days_180num_50sum | within_days_180num_50mean | within_days_180num_75sum | within_days_180num_75mean | within_days_180num_985sum | within_days_180num_985mean | within_days_180num_100sum | within_days_180num_100mean | within_days_365num_unqsum | within_days_365num_unqmean | within_days_365num_unqcount | within_days_365total_secssum | within_days_365total_secsmean | within_days_365num_25sum | within_days_365num_25mean | within_days_365num_50sum | within_days_365num_50mean | within_days_365num_75sum | within_days_365num_75mean | within_days_365num_985sum | within_days_365num_985mean | within_days_365num_100sum | within_days_365num_100mean | within_days_9999num_unqsum | within_days_9999num_unqmean | within_days_9999num_unqcount | within_days_9999total_secssum | within_days_9999total_secsmean | within_days_9999num_25sum | within_days_9999num_25mean | within_days_9999num_50sum | within_days_9999num_50mean | within_days_9999num_75sum | within_days_9999num_75mean | within_days_9999num_985sum | within_days_9999num_985mean | within_days_9999num_100sum | within_days_9999num_100mean | total_plan_days | total_amount_paid | amount_paid_per_day | diff_renewal_duration | diff_plan_amount_paid_per_day | latest_payment_method_id | latest_plan_days | latest_plan_list_price | latest_amount_paid | latest_auto_renew | latest_is_cancel | latest_amount_paid_per_day | date_featuresdatemax_date_year | date_featuresdatemax_date_month | date_featuresdatemax_date_day | date_featuresdatemax_date_absday | date_featuresdatemin_date_year | date_featuresdatemin_date_month | date_featuresdatemin_date_day | date_featuresdatemin_date_absday | female | male | latest_transaction_date_year | latest_transaction_date_month | latest_transaction_date_day | latest_transaction_date_absday | latest_expire_date_year | latest_expire_date_month | latest_expire_date_day | latest_expire_date_absday | latest_trans_vs_expire | latest_trans_vs_log | latest_log_vs_expire | registration_time | predictions | probability_churn | proposed_spend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
msno | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
x5+AtzOZKAtnAJBsCIdAyiRl+1p9nIvAYchIkS4zaS4= | 22 | 31 | 9 | 20150202 | 748 | 2 | 2 | 1 | 330.174 | 330.174 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 179 | 44.750000 | 4 | 41022.435 | 10255.608750 | 31 | 7.750000 | 4 | 1.000000 | 0 | 0.000000 | 6 | 1.500000 | 153 | 38.250000 | 193 | 38.600000 | 5 | 43158.296 | 8631.659200 | 34 | 6.800000 | 11 | 2.200000 | 0 | 0.000000 | 6 | 1.200000 | 158 | 31.600000 | 275 | 22.916667 | 12 | 59410.879 | 4950.906583 | 52 | 4.333333 | 18 | 1.500000 | 7 | 0.583333 | 7 | 0.583333 | 214 | 17.833333 | 1687 | 33.740000 | 50 | 372126.228 | 7442.524560 | 371 | 7.420000 | 79 | 1.580000 | 52 | 1.040000 | 50 | 1.000000 | 1303 | 26.060000 | 4329 | 36.686441 | 118 | 997354.598 | 8452.157610 | 719 | 6.093220 | 137 | 1.161017 | 92 | 0.779661 | 118 | 1.000000 | 3595 | 30.466102 | 8460 | 35.696203 | 237 | 1947010.556 | 8215.234414 | 1507 | 6.358650 | 272 | 1.147679 | 155 | 0.654008 | 171 | 0.721519 | 7136 | 30.109705 | 21497 | 39.883117 | 539 | 5178191.657 | 9607.034614 | 3037 | 5.634508 | 686 | 1.272727 | 365 | 0.677180 | 367 | 0.680891 | 19132 | 35.495362 | 555 | 2682 | 4.832432 | 0 | 0.0 | 27 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 19 | 6259 | 2015 | 2 | 2 | 5511 | 0 | 1 | 2017 | 2 | 1 | 6241 | 2017 | 2 | 28 | 6268 | -27 | -18 | -9 | 2015 | 0 | 0.008569 | 0.0 |
WiVvUGUuxmRviEX69svzHUC/zhpyJZdAm3ZyExXsjHA= | 17 | 40 | 9 | 20071006 | 787 | 46 | 46 | 1 | 11779.310 | 11779.310 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 46 | 46 | 105 | 26.250000 | 4 | 26890.522 | 6722.630500 | 2 | 0.500000 | 0 | 0.000000 | 2 | 0.500000 | 0 | 0.000000 | 104 | 26.000000 | 145 | 20.714286 | 7 | 36302.890 | 5186.127143 | 7 | 1.000000 | 1 | 0.142857 | 2 | 0.285714 | 2 | 0.285714 | 137 | 19.571429 | 276 | 25.090909 | 11 | 55003.399 | 5000.309000 | 67 | 6.090909 | 29 | 2.636364 | 9 | 0.818182 | 6 | 0.545455 | 188 | 17.090909 | 1425 | 24.568966 | 58 | 322053.197 | 5552.641328 | 275 | 4.741379 | 170 | 2.931034 | 74 | 1.275862 | 40 | 0.689655 | 1101 | 18.982759 | 2771 | 23.091667 | 120 | 596865.459 | 4973.878825 | 617 | 5.141667 | 347 | 2.891667 | 140 | 1.166667 | 110 | 0.916667 | 2002 | 16.683333 | 6125 | 26.864035 | 228 | 1388082.365 | 6088.080548 | 1567 | 6.872807 | 538 | 2.359649 | 255 | 1.118421 | 229 | 1.004386 | 4862 | 21.324561 | 26479 | 43.408197 | 610 | 6661356.419 | 10920.256425 | 5362 | 8.790164 | 2650 | 4.344262 | 1323 | 2.168852 | 1318 | 2.160656 | 23024 | 37.744262 | 543 | 2831 | 5.213628 | 0 | 0.0 | 39 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 27 | 6267 | 2015 | 1 | 2 | 5480 | 1 | 0 | 2017 | 1 | 31 | 6240 | 2017 | 3 | 12 | 6280 | -40 | -27 | -13 | 2007 | 0 | 0.001696 | 0.0 |
ur0rGRoV2XJOYpNbzl5n/jBEV9PrKDwZX4QeO03gXl8= | 6 | 21 | 4 | 20160822 | 189 | 36 | 36 | 1 | 9240.939 | 9240.939 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 35 | 35 | 371 | 61.833333 | 6 | 107030.392 | 17838.398667 | 7 | 1.166667 | 3 | 0.500000 | 0 | 0.000000 | 1 | 0.166667 | 415 | 69.166667 | 756 | 68.727273 | 11 | 220080.304 | 20007.300364 | 58 | 5.272727 | 27 | 2.454545 | 5 | 0.454545 | 7 | 0.636364 | 843 | 76.636364 | 1766 | 65.407407 | 27 | 590614.385 | 21874.606852 | 138 | 5.111111 | 74 | 2.740741 | 15 | 0.555556 | 16 | 0.592593 | 2275 | 84.259259 | 5486 | 66.096386 | 83 | 1899764.373 | 22888.727386 | 493 | 5.939759 | 116 | 1.397590 | 44 | 0.530120 | 41 | 0.493976 | 7351 | 88.566265 | 11591 | 68.994048 | 168 | 3408784.787 | 20290.385637 | 1131 | 6.732143 | 198 | 1.178571 | 90 | 0.535714 | 160 | 0.952381 | 13134 | 78.178571 | 12290 | 69.044944 | 178 | 3576194.197 | 20090.978635 | 1290 | 7.247191 | 205 | 1.151685 | 99 | 0.556180 | 178 | 1.000000 | 13798 | 77.516854 | 12290 | 69.044944 | 178 | 3576194.197 | 20090.978635 | 1290 | 7.247191 | 205 | 1.151685 | 99 | 0.556180 | 178 | 1.000000 | 13798 | 77.516854 | 210 | 1043 | 4.966667 | 0 | 0.0 | 39 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 27 | 6267 | 2016 | 8 | 22 | 6078 | 1 | 0 | 2017 | 1 | 31 | 6240 | 2017 | 3 | 26 | 6294 | -54 | -27 | -27 | 2016 | 0 | 0.001179 | 0.0 |
M/PccoJW/A9myX+eCodcY8Z4LMD1r+d6YKzUNv4PMZo= | 1 | 0 | 7 | 20130610 | 789 | 15 | 15 | 1 | 3416.639 | 3416.639 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 16 | 204 | 29.142857 | 7 | 88005.151 | 12572.164429 | 5 | 0.714286 | 3 | 0.428571 | 2 | 0.285714 | 2 | 0.285714 | 395 | 56.428571 | 441 | 31.500000 | 14 | 168815.462 | 12058.247286 | 52 | 3.714286 | 31 | 2.214286 | 20 | 1.428571 | 6 | 0.428571 | 735 | 52.500000 | 781 | 26.033333 | 30 | 303585.197 | 10119.506567 | 118 | 3.933333 | 91 | 3.033333 | 41 | 1.366667 | 10 | 0.333333 | 1282 | 42.733333 | 1984 | 27.178082 | 73 | 581820.141 | 7970.138918 | 217 | 2.972603 | 178 | 2.438356 | 69 | 0.945205 | 24 | 0.328767 | 2343 | 32.095890 | 3501 | 25.554745 | 137 | 932913.407 | 6809.586912 | 373 | 2.722628 | 282 | 2.058394 | 106 | 0.773723 | 50 | 0.364964 | 3661 | 26.722628 | 7250 | 26.459854 | 274 | 1818035.906 | 6635.167540 | 768 | 2.802920 | 504 | 1.839416 | 196 | 0.715328 | 116 | 0.423358 | 7089 | 25.872263 | 17809 | 27.440678 | 649 | 4533034.736 | 6984.645202 | 1352 | 2.083205 | 937 | 1.443760 | 336 | 0.517720 | 198 | 0.305085 | 17432 | 26.859784 | 840 | 3631 | 4.322619 | 0 | 0.0 | 41 | 30 | 99 | 99 | 1 | 0 | 3.300000 | 2017 | 2 | 28 | 6268 | 2015 | 1 | 1 | 5479 | 0 | 0 | 2017 | 2 | 24 | 6264 | 2017 | 3 | 25 | 6293 | -29 | -4 | -25 | 2013 | 0 | 0.000000 | 0.0 |
BZbN3U+ghA0lwOA34yF/GNHbJb73T48nEZGHc4bcikc= | 9 | 31 | 9 | 20080410 | 788 | 25 | 25 | 1 | 13113.138 | 13113.138 | 14 | 14 | 1 | 1 | 5 | 5 | 3 | 3 | 93 | 93 | 101 | 16.833333 | 6 | 33808.385 | 5634.730833 | 29 | 4.833333 | 2 | 0.333333 | 12 | 2.000000 | 6 | 1.000000 | 165 | 27.500000 | 157 | 13.083333 | 12 | 48485.571 | 4040.464250 | 40 | 3.333333 | 4 | 0.333333 | 14 | 1.166667 | 9 | 0.750000 | 219 | 18.250000 | 484 | 21.043478 | 23 | 139573.866 | 6068.428957 | 97 | 4.217391 | 21 | 0.913043 | 23 | 1.000000 | 20 | 0.869565 | 554 | 24.086957 | 1416 | 22.838710 | 62 | 380409.666 | 6135.639774 | 213 | 3.435484 | 51 | 0.822581 | 50 | 0.806452 | 59 | 0.951613 | 1583 | 25.532258 | 4484 | 30.093960 | 149 | 1179486.041 | 7916.013698 | 791 | 5.308725 | 162 | 1.087248 | 143 | 0.959732 | 259 | 1.738255 | 4752 | 31.892617 | 8142 | 26.607843 | 306 | 2174747.697 | 7107.018618 | 1684 | 5.503268 | 343 | 1.120915 | 312 | 1.019608 | 519 | 1.696078 | 8373 | 27.362745 | 12775 | 20.942623 | 610 | 3287076.077 | 5388.649307 | 2773 | 4.545902 | 602 | 0.986885 | 543 | 0.890164 | 926 | 1.518033 | 12292 | 20.150820 | 480 | 3278 | 6.829167 | 0 | 0.0 | 34 | 30 | 149 | 149 | 1 | 0 | 4.966667 | 2017 | 2 | 27 | 6267 | 2015 | 1 | 1 | 5479 | 1 | 0 | 2017 | 2 | 28 | 6268 | 2017 | 3 | 31 | 6299 | -31 | 1 | -32 | 2008 | 0 | 0.000204 | 0.0 |
Validation
We're working out the following to make sure that the nubmers for the reacquistion marketing spend make sense in the context of our business: * Total marketing spend * Marketing spend per customer (all customers) * Marketing spend per customer with probability of churn (indicated as churn by the model) * Comparison with the earnings from the sample
# Validation
# What is the total proposed marketing spend?
tpms = marketing_data['proposed_spend'].sum()
# Averaging (per user)
tpmsa = tpms/len(predictions)
# Averaging (for users marked for churn)
ch = np.count_nonzero(predictions)
tpmsac = tpms/ch
print('''
Total proposed marketing spend (NTD): NT${:,.0f}
Total number of users in the sample: {:d}
Average Reacquisition/retention marketing spend/user (NTD): NT${:,.4f}
Total number of users projected to churn (any probability): {:d}
Average Reacquisition/rention marketing spend for users marked to churn (NTD): NT${:,.4f}
''' .format(tpms, len(predictions), tpmsa, ch, tpmsac)
)
Total proposed marketing spend (NTD): NT$77,025
Total number of users in the sample: 11681
Average Reacquisition/retention marketing spend/user (NTD): NT$6.5940
Total number of users projected to churn (any probability): 705
Average Reacquisition/rention marketing spend for users marked to churn (NTD): NT$109.2551
# Validation
# How does the marketing spend compare to the projected earnings from this sample
# Ratio of marketing(per user) to topline for this sample
rms = (tpms/len(predictions))/altv3y
print('''
Projected 3 year earnings from users in dev data (per user) (NTD): NT${:,.4f}
Ratio of Reacquistion/Retention marketing to topline from users in dev data: {:.4f}
'''.format(altv3y, rms)
)
Projected 3 year earnings from users in dev data (per user) (NTD): NT$4,975.1425
Ratio of Reacquistion/Retention marketing to topline from users in dev data: 0.0013
Visual for marketing spend - amoung spent/day by all the customers
# Marketing spend - amount spent/day by the users
# plt.scatter(marketing_data['amount_paid_per_day'], marketing_data['proposed_spend'], s=4,
#c=cm.hot(marketing_data['proposed_spend']))
plt.xlabel('Amount paid per day by a customer (NTD)')
plt.ylabel('Proposed retention marketing spend per customer (NTD)')
plt.title('Retention marketing spend - Amount paid per day by a customer')
plt.scatter(marketing_data['amount_paid_per_day'], marketing_data['proposed_spend'], s=3)
plt.show()
The plot above looks at proposed retention spend by customer against the amount paid per day by customer. Some scatter exists in the plot showing that not all members at a specific NTD/day price level are equal in terms of risk of churn. The proposed retention spend is calculated over the three year (assumed lifetime) of the customer (so at a level of 100 NTD, the proposed incentive per day would be ~0.09 NTD).
Visual for marketing spend in the context of the total number of customers
# Visual of spend in context of the number of users
plt.title('Proposed retention marketing spend spread among all customers')
plt.xlabel('Proposed retention marketing spend per customer')
plt.ylabel('Number of customers')
plt.hist(marketing_data['proposed_spend'] )
plt.show()
From the plot above, we can see that for most users, the proposed spend is 0, due to the predicted probability that they won't churn, and therefore no incentives should be offered. This makes economic sense as our dataset only contains 6% churn, and we cannot afford to offer incentives to too many members if their unlikely to churn.
We do see that there is some data for proposed spend between 50-150 which is consistent with the former scatter plot, however overwhemingly the proposed spend on a customer is 0 NTD (New Taiwan Dollar).
Visual for marketing spend per user
#Scatter Plot to Show Proposed Spend per User
sorted_marketing = marketing_data.sort_values(by='proposed_spend')
sorted_marketing = sorted_marketing[sorted_marketing['proposed_spend'] >0.0]['proposed_spend']
sorted_marketing = sorted_marketing.reset_index()
plt.scatter(sorted_marketing.index, sorted_marketing['proposed_spend'])
plt.xlabel('Member - re-indexed to integer for plotting purposes')
plt.ylabel('Maximum proposed spend on each user (NTD)')
plt.title('Sorted proposed spend on user over 3 year lifetime')
Text(0.5,1,'Sorted proposed spend on user over 3 year lifetime')
The plot above looks at the sorted proposed spend for users. The plot has been sorted in order to determine if there are any insights that can be made by looking at proposed spend levels. We can see near the 85 NTD (New Taiwan Dollar) mark there is a slight plateau, which could indicate a good level of incentive to offer users.
Beyond the slight plateau the maximum proposed spend appears to increase approximately linearly over the range of 75 to 130 NTD. Indication from this plot shows that users who's proposed spend is >130 may not be worthwhile pursuing due to the high cost of incentives.
The proposed spend is the maximum spend allocated over the customer lifetime (assumed 3 years) and would be distributed through discounts and incentives. It may be possible to prevent churn by offering less than this maximum spend amount, which is an opportunity for a future ML model.
# Recap of key metrics
print('''
------> Key metrics from the model:
Total number of users in the sample: {}
Total number of users projected to churn (any probability): {}
Total proposed marketing spend: NT${:,.0f}
Total revenue (3 year) from all users in the sample: NT${:,.0f}
Total revenue (3 year) from users at risk in the sample: NT${:,.0f}
Average Reacquisition/retention marketing spend/user: {}
Average Reacquisition/retention marketing spend for users marked to churn: {}
Ratio of Reacquistion marketing spend to revenue from all users: {:.4f}
Ratio of Reacquistion marketing spend to reveuene from at risk users: {:.4f}
''' .format(len(predictions), ch, int(tpms), int(altv3y*len(predictions)),
int(altv3y*ch), int(tpmsa), int(tpmsac),
tpms/(int(altv3y*len(predictions))),
tpms/int(altv3y*ch) )
)
------> Key metrics from the model:
Total number of users in the sample: 11681
Total number of users projected to churn (any probability): 705
Total proposed marketing spend: NT$77,024
Total revenue (3 year) from all users in the sample: NT$58,114,639
Total revenue (3 year) from users at risk in the sample: NT$3,507,475
Average Reacquisition/retention marketing spend/user: 6
Average Reacquisition/retention marketing spend for users marked to churn: 109
Ratio of Reacquistion marketing spend to revenue from all users: 0.0013
Ratio of Reacquistion marketing spend to reveuene from at risk users: 0.0220
Economic Impact Summary
We're able to create a very usable model for the retention/reacquistion effort in the business. In summary, of our NT\$58M 3-year revenue from our customer base, we estimate we will lose NT\$3.5M to churn. Accounting for the probability that users will actually churn, the team recommends spending NT\$77K on trying to retain these users. In the marketing data, the NT\$77K is broken down by user, however, we would recommend a discount program with a few tiers, as opposed to custom offers for every user.
The marketing data base keeps most original customer parameters and adds the following for use in business planning:
- predictions - Will the customer churn (1 will churn)
- predctions_prob - What's the probability of the customer leaving (Higher is bad)
- proposed_spend - We can spend up to this value to keep this customer
Our calculations (mostly for validation in the report) show that the retention marketing spend is inline with the revenue opportunity for the business. Key metrics are calculated again above.
One caveat to our assertion is effectiveness of reacquistion marketing spend. Feedback on that could influence the value of our spend as we put the model into production.
Another thing to note in our model is that we're proposing a spend for every customer at risk. Reality may be that we cannot save customers above 80% probability of churn. Should we invest in these users?
6. Final Insights and Takeaways
Here are our insights and commentary from our analysis:
Predictive Modeling Summary and Future Considerations
As discussed in the modeling section, the final model had a recall performance of 98.3% (on the test data), correctly predicting almost every user who actually churned (408 of 415 users), and also correctly predicting amost every user who would actually stay (6,444 of 6,609 users).
Given the strong performance of the current model, any future enhancements would need to balance the cost of making those enhancements against the potential benefits of retaining a few more users. That said, the team thinks the best place to find additional gains in the model is by engineering additional features from the users, potentially from reviewing user listening habits such as music genre.
Economic Impact
We created and validated a model to guide retention program spending for customers at the risk of churn (RAC = .75 * CAC * POC * (LTV/OLTV) where RAC is the Reacquistion cost or retention spending/customer, CAC is the cost of customer acquistion, LTV is the lifetime value of a customer and OLTV is the optimum lifetime value (the revenue from our best customer).
The model works with a spend ceiling (Customer acquisiton cost) that's about 10% of the lifetime value of an average customer. We stay below this limit by reducing the amount by value of the customer and her probability of churn.
A sample run with our development data shows a total marketing spend between 77K to protect ~3.5 M of revenue. The model shows how this amount can be divvied up among the customers (in the marketing data frame).
A production version of this model can guide the business on both the candidates and amount for retention spending among the customer base.
Productization and Future Development
Implementation as a pipeline, performance testing and feedback loop
Our assessment is that at least the following would have to be done to put the model into production:
* Implementation as a pipeline: The model is quite efficient and can be implemented offline or a near realtime implementation
*For a offline implementation, data can be picked up from backup databases and pumped into a system like HDFS. Calculations on predictions can then be done by compute on Hadoop or an application like SPARK
*A new realtime pipeline can be built by duplicating usage and transaction events to a KAFKA--->SPARK--->HDFS pipeline for writing and then batch processing similar to above
*Performance testing: Further performance testing is recommended for both an offline or realtime pipeline. Our testing was with a small subset on a laptop and may not reflect the needs of the enterprise
*Feedback loop: A feedback loop and experiments for following could increase the efficacy of the model
*Predictions: Feedback on predictions can be implemented by having a small control group where we do not use our retention methods
*Retention spending: Feedback on our retention methods can directly come from the group on which we are using the methods. In the simplest case we'll just have to find out if the customer stayed with the service
*Models and calibration: Testing and calibration would be needed about once in a Q
The goal of above would be to maintain or improve the performance vs accuracy/recall for the model while rolling it out to growing customer cohorts. From this analysis, we hope to provide insight into customers that have a potential to churn, and recommend incentives for those customers to prevent churn. We believe that the future work listed above will help to make a significant impact in business operations which will generate more revenue and profit for the company.
An additional topic that could be explored in the future is understanding at what incentive level a user won't churn. Here we present the maximum incentive spend (over a 3 year span) for a customer, however from a business sens we would like to offer the minimum incentive to prevent churn. This model can build on what has alredy been developed here and will require additional data to capture incentives that have previously been offered and the outcome of the offering.
Initial Data Filter and QC
import google.datalab.bigquery as bq
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Pasted here is the code that was initially ran. As mentioned in the report, our dataset is large (~31GB combined tables) and couldn't all be ran on a single computer. For the sake of this analysis, we have filtered the data down to a mangageable size (~1.5GB). Additionally, we saw initially that our dataset contained only 6% churn. In order to help improve our model, we wanted to have a filtered dataset that contained a 50/50 split between churn and not churn.
The methodology to get this data into a useable format was the following: - Upload data into google cloud storage - Utilize Google BigQuery to run SQL statement against the datasets - Export the datasets as CSV files to be managed locally
For our dataset we wanted to have approximately 100k members (total dataset size is ~993k members so approximately 10% of our data). Utilizing a 50/50 split we therefore wantedk 50k members who did churn and 50k members who didn't churn. In order to filter the tables we first ran a query to get 50k members who did churn in the labels table:
SELECT * FROM [w207_kkbox_bq_data.labels] WHERE (RAND(5) < 50000/(SELECT COUNT(*) FROM [w207_kkbox_bq_data.labels] WHERE is_churn = 0) AND is_churn = 0)
SELECT * FROM [w207_kkbox_bq_data.labels] WHERE (RAND(5) < 50000/(SELECT COUNT(*) FROM [w207_kkbox_bq_data.labels] WHERE is_churn = 0) AND is_churn = 1)
These two tables were written to cloud storage and provided the baseline of the members that we would keep when querying other tables. The labels table contains member ID as well as the is_churn (dependent) variable. We now utilize these two tables to query the other datasets joining on the members ID [we kept these two tables separate and then combined the is_churn = 0 and is_churn = 1 into a combined table at the end, in hindsight it would have been more efficient to combine the labels table first].
Members Table:
SELECT members.* FROM [w207_kkbox_bq_data.members] as members INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_1] as lab ON members.msno = lab.msno
SELECT members.* FROM [w207_kkbox_bq_data.members] as members INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_0] as lab ON members.msno = lab.msno
Transactions Table:
SELECT transactions.* FROM [w207_kkbox_bq_data.transactions] as transactions INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_0] as lab ON transactions.msno = lab.msno
SELECT transactions.* FROM [w207_kkbox_bq_data.transactions] as transactions INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_1] as lab ON transactions.msno = lab.msno
User_Logs Table:
SELECT user_logs.* FROM [w207_kkbox_bq_data.user_logs] as user_logs INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_1] as lab ON user_logs.msno = lab.msno
SELECT user_logs.* FROM [w207_kkbox_bq_data.user_logs] as user_logs INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_0] as lab ON user_logs.msno = lab.msno
With this methodology there was two tables (is_churn = 0 and is_churn = 1) for each table in the original dataset. We then combined/appended the two tables back into a singular table and exported to csv (labels_filtered.csv, members_filtered.csv, transactions_filtered.csv, user_logs_filtered.csv). From these local files we then began development towards predcting churn.
While we moved ahead with this filtered dataset, we also wanted to perform quick EDA on the entire dataset to the filtered dataset, to compare and contrast some of the features in the dataset. We understand that we artificially inflated the is_churn = 1 statistic for purposes of model training, therefore there will be some changes between the filtered and original dataset.
%%bq query --name is_churn_large
SELECT SUM(is_churn) / COUNT(is_churn)
FROM `w207_kkbox_bq_data.labels` AS labels
is_churn_large.execute().result()
f0_ |
---|
0.06392287077349786 |
(rows: 1, time: 0.3s, cached, job: job_WFvTJUNTiVU0lg8eZM7HiKzbFhYR)
%%bq query --name is_churn_small
SELECT SUM(is_churn) / COUNT(is_churn)
FROM `w207_kkbox_bq_data.labels_filtered_100k` AS labels
is_churn_small.execute().result()
f0_ |
---|
0.5005559729526672 |
(rows: 1, time: 0.1s, cached, job: job_6qccT3EvuDYAbLt8e7bMXy_rQuJC)
Here we look at the proportion of is_churn data. We can see that we manipulated the churn percentage in the filtered dataset. This was intentional in order to create an approximate even split between is churn and is_not_churn to benefit the model training and increase the number of examples of is_churn (since our original dataset only contains 6% is_churn).
%%bq query --name cities
SELECT city, COUNT(msno) AS population
FROM `w207_kkbox_bq_data.members`
GROUP BY city
cities.execute().result()
city | population |
---|---|
1 | 4804326 |
5 | 385069 |
9 | 47639 |
6 | 135200 |
4 | 246848 |
13 | 320978 |
22 | 210407 |
14 | 89940 |
8 | 45975 |
15 | 190213 |
17 | 27772 |
10 | 32482 |
11 | 47489 |
18 | 38039 |
12 | 66843 |
16 | 5092 |
21 | 30837 |
7 | 11610 |
3 | 27282 |
19 | 1199 |
20 | 4233 |
(rows: 21, time: 0.1s, cached, job: job_VRruJOzjmzBAePlRp8ei86w2l3uB)
%chart columns --data cities --fields city,population
%%bq query --name cities
SELECT members_city AS city, COUNT(members_msno) AS population
FROM `w207_kkbox_bq_data.members_filtered_100k`
GROUP BY city
cities.execute().result()
city | population |
---|---|
1 | 40639 |
10 | 763 |
4 | 5670 |
5 | 8469 |
13 | 11012 |
14 | 2273 |
15 | 5032 |
8 | 937 |
12 | 1395 |
22 | 4825 |
7 | 290 |
6 | 3113 |
21 | 657 |
17 | 581 |
11 | 1021 |
16 | 118 |
9 | 1107 |
18 | 900 |
3 | 584 |
20 | 76 |
19 | 11 |
(rows: 21, time: 0.2s, cached, job: job_gV83pX9lLCc3jSdGZhB1sEsQFWW0)
%chart columns --data cities --fields city,population
We can see that there is one city (labeled city = 1 that uses kkbox service far more than any other city accross the geographical area that kkbox is used. Comparing the two datasets, we see that overall city one has a higher proportion of the total members in the original dataset compared to the filtered.
%%bq query --name city_renew
SELECT members.city AS city, CAST(SUM(transactions.is_auto_renew) / COUNT(members.city) AS FLOAT64) AS renew_by_city
FROM `w207_kkbox_bq_data.members` AS members
INNER JOIN `w207_kkbox_bq_data.transactions` AS transactions
ON members.msno = transactions.msno
GROUP BY members.city
ORDER BY renew_by_city DESC
%chart columns --data city_renew --fields city,renew_by_city
%%bq query --name city_renew
SELECT members.members_city AS city, CAST(SUM(transactions.transactions_is_auto_renew) / COUNT(members.members_city) AS FLOAT64) AS renew_by_city
FROM `w207_kkbox_bq_data.members_filtered_100k` AS members
INNER JOIN `w207_kkbox_bq_data.transactions_filtered_100k` AS transactions
ON members.members_msno = transactions.transactions_msno
GROUP BY city
ORDER BY renew_by_city DESC
%chart columns --data city_renew --fields city,renew_by_city
In the plot above we analyze the proportion of auto-renew customers by city. From before we saw that city 1 had the highest count overall for users. Here we can see that it also has the highest proportion of auto-renew customers. Overall, it appears that most customers are on an auto-renew plan which is good from a business perspective.
Comparing the two dataframes there is similar structure to both bar charts, which means the datasets for this feature are consistent.
%%bq query --name gender_describe
SELECT COUNT(gender) AS gen FROM `w207_kkbox_bq_data.members` AS members GROUP BY gender
gender_describe.execute().result()
gen |
---|
1144613 |
1195355 |
0 |
(rows: 3, time: 0.2s, cached, job: job_AfNz0MGXT8OgGZx6EY3RMrmDKeOA)
%%bq query --name gender_describe
SELECT COUNT(members_gender) AS gen FROM `w207_kkbox_bq_data.members_filtered_100k` AS members GROUP BY members_gender
gender_describe.execute().result()
gen |
---|
24507 |
21630 |
0 |
(rows: 3, time: 0.1s, cached, job: job__0rEg7lP0hion9yO77VylbSMR6ij)
From the printout above, we can see that there is similarity amongst the gender distribution between the whole and filtered dataset, male has a slightly higher population than female.
Below we will now printout the description of the tables for comparison.
%%bq query --name describe_members
SELECT COUNT(DISTINCT(city)) AS city_count, COUNT(DISTINCT(registered_via)) AS registration_type, AVG(bd) AS average_bd
FROM `w207_kkbox_bq_data.members`
describe_members.execute().result()
city_count | registration_type | average_bd |
---|---|---|
21 | 18 | 9.795794295951625 |
(rows: 1, time: 0.2s, cached, job: job_s_s-BUim86KEeQsaPWK8aI7ePUsl)
%%bq query --name describe_members
SELECT COUNT(DISTINCT(members_city)) AS city_count, COUNT(DISTINCT(members_registered_via)) AS registration_type, AVG(members_bd) AS average_bd
FROM `w207_kkbox_bq_data.members_filtered_100k`
describe_members.execute().result()
city_count | registration_type | average_bd |
---|---|---|
21 | 5 | 14.915069350530343 |
(rows: 1, time: 0.2s, cached, job: job_RCIg6slHgd8nt-De7OuUfeub23gU)
We can see here that the filtered dataset doesn't capture all of the registration types (only 5 of the 18 total). Presumably there are a select number of very popular registration types, and several lesser used, which is why our data subset only contains a portion of the total registration types. Average birthday is higher (presumably younger if birthday is computed as days after a particular date) in our filtered dataset.
%%bq query --name describe_transactions
SELECT COUNT(DISTINCT(payment_method_id)) as payment_method, COUNT(DISTINCT(plan_list_price)) as num_plans, SUM(is_auto_renew) / COUNT(is_auto_renew) as prop_auto_renew,
AVG(actual_amount_paid) AS plan_revenue, SUM(is_cancel) / COUNT(is_cancel) AS prop_cancel
FROM `w207_kkbox_bq_data.transactions`
describe_transactions.execute().result()
payment_method | num_plans | prop_auto_renew | plan_revenue | prop_cancel |
---|---|---|---|---|
40 | 51 | 0.8519661406812573 | 141.98732048354586 | 0.03976522648819046 |
(rows: 1, time: 0.1s, cached, job: job_w7nJR7nlHvuW0mk-jUsSP5EW_pu6)
%%bq query --name describe_transactions
SELECT COUNT(DISTINCT(transactions_payment_method_id)) as payment_method, COUNT(DISTINCT(transactions_plan_list_price)) as num_plans, SUM(transactions_is_auto_renew) / COUNT(transactions_is_auto_renew) as prop_auto_renew,
AVG(transactions_actual_amount_paid) AS plan_revenue, SUM(transactions_is_cancel) / COUNT(transactions_is_cancel) AS prop_cancel
FROM `w207_kkbox_bq_data.transactions_filtered_100k`
describe_transactions.execute().result()
payment_method | num_plans | prop_auto_renew | plan_revenue | prop_cancel |
---|---|---|---|---|
37 | 42 | 0.8315028382832431 | 145.68530483745585 | 0.031064849396989492 |
(rows: 1, time: 0.1s, cached, job: job_HNnfHsxnSm7qJdbkItU1IJPU4-ej)
Comparing the filtered and total datasets, we can see that both payment method and number of plans decreases slightly in the filtered dataset, but not too much. One key highlight here is that moving to the full dataset will require fitting on the larger dataset, because there is data objects that have not been seen by the smaller model. The proportion of auto renew members is very similar across both datasets as is the revenue from the plan (average plan price). We can see the slightly lower members cancel in the filtered dataset when compared to the total dataset, however both datasets have a low proportion of cancel (and from before, cancel is lower than is_churn which is an interesting observation). Additionally, as the filtered dataset contains 50% churn, having such a low (and even lower than the total dataset) churn statistic is surprising.
%%bq query --name describe_user_logs
SELECT SUM(total_secs) AS listening_time, SUM(num_unq) AS number_unique, AVG(date) AS average_date, SUM(num_100) AS sum_full_songs, SUM(num_25) AS sum_25per_songs
FROM `w207_kkbox_bq_data.user_logs`
describe_user_logs.execute().result()
listening_time | number_unique | average_date | sum_full_songs | sum_25per_songs |
---|---|---|---|---|
-5.665342138557264e+20 | 11798546903 | 20157392.77279009 | 12045813613 | 2553501878 |
(rows: 1, time: 0.1s, cached, job: job_SJWC3sRNxUqMaP_eGiz7LB-SRhZk)
%%bq query --name describe_user_logs
SELECT SUM(user_logs_total_secs) AS listening_time, SUM(user_logs_num_unq) AS number_unique, AVG(user_logs_date) AS average_date, SUM(user_logs_num_100) AS sum_full_songs, SUM(user_logs_num_25) AS sum_25per_songs
FROM `w207_kkbox_bq_data.user_logs_filtered_100k`
describe_user_logs.execute().result()
listening_time | number_unique | average_date | sum_full_songs | sum_25per_songs |
---|---|---|---|---|
-3.1202667405739975e+19 | 711055859 | 20158095.16186604 | 719873498 | 155565964 |
(rows: 1, time: 0.1s, cached, job: job_6Zclmg8gA3QsOm_qb04JlROqwHV9)
Comparing the two datasets, we can see that both have an average listening time of a negative value (which intuitiviely doesn't make sens). It will be important to better understand how this data is collected to properly quality check the column and values. Number of unique is higher in the entire dataset, which makes sense as there is more data. Average date is in the form YYYYMMDD so the average date for user logs is in 2015 (averageing a date as an integer produces an incorrect date - datalab doesn't handle dates the same way as big query so further analysis in this notebook was not carried out). The sum of full songs and 25% songs is also higher in the full dataset which again makes sense. It is interesting to see that the proportion to full_songs/25per_songs is approximately the same (4.71 in the full vs 4.63 in the filtered).