Music Churn Prediction

by Cameron Kennedy, Gaurav Khanna, Aaron Olson - Sat 08 September 2018
Tags: #big query #python #sci-kit learn #boosted trees #xgboost

Overview of Notebooks

For this project, the team created 3 separate Jupyter Notebooks to document its work:

1) Data Preparation / Feature Extraction Notebook: This notebook gives an overview of the project, and then takes the raw data, performs some initial exploration, and generates features for the predictive models. It also performs a brief exploratory data analysis on the feature set. The output this notebook output a .pkl file of features for the second notebook to read, which saves considerable time when building the models.

2) Predictive Modeling Notebook: This notebook reads the .pkl file, builds machine learning models to predict user churn, calculates and calibrates churn probabilities, and generates a projected economic impact of users who leave.

3) Initial Data Sourcing and Validation Notebook (HTML file): This is a static notebook (uploaded as HTML file - not intended for executing code) that documents two other aspects of the project that don't logically fit in either of the first two notebooks:

  • First, it contains the initial data extraction code used in Google BigQuery to reduce the data set from ~30GB down to ~1.6GB, to enable it to run on local machines.

  • Second, it contains some code that performs data integrity checks, validating that the items extracted in our smaller data set approximately match those in the full data set (e.g., same level of churn, the same timeframe, etc.)

Table of Contents (this notebook only)

  1. Project Overview
  2. Data Set Overview
  3. Initial Data Loading
  4. User Logs Data: Preparation and Feature Extraction
  5. Transaction Data: Preparation and Feature Extraction
  6. Joining Features and Data Manipulation
  7. Quick Exploratory Data Analysis
  8. Writing Output

1. Project Overview

This dataset is comprised of data collected by WSDM regarding a music streaming subscription available through KKBOX.

Project Goals:

The project aims to accomplish the following goals:

  • Create a model to predict customer churn from usage and transaction data
  • Create an economic model for retention
  • Recommend a process for keeping the churn and economic retention models updated with latest information

2. Data Set Overview

The initial data set contains 24 variables (25 input variables and 1 variable to predict), these are spread across 4 tables. Additional details:

  • Original format: csv
  • Total Size: 31.14 GB, reduced to 1.6GB for analysis on local machines.
  • User Count: 1.02 million labeled users contained in the Train table (88,544 users after reduction)
  • Date Range: The data of customer usage and trasactions with the service spans 26 months, from Jan. 2015 to Feb. 2017. However, one of the data fields is initial date users joined the service, with dates ranging from 2004 to 2017.
  • Balance: Approximately 6% of users in the data churned (positive labels); the remaining 94% stayed (negative labels).

Listed below are the tables and variables or features available for study:

Table: Transactions

This table contains transaction data for each user. Each row is a payment transaction.

  • Data Shape: 21.5M rows X 9 columns
  • Data Size: 1.6GB

Data Fields:

  • Msno: User ID
  • Payment_method_id: Payment Method
  • Payment_plan_days: Length of plan
  • Plan_list_price: Price for the plan
  • Actual_amount_paid: Amount paid
  • Is_auto_renew: T/F flag determining whether membership is auto-renew or not
  • Transaction Date: Date of purchase
  • Membership_expire_date: Expiry date
  • Is_cancel: T/F flag determining whether or not the user canceled service. This field is correlated with the is_churn category, though it isn’t identical, as it also captures users who change service.

Table: User Logs

This table lists who, how, and when users used the service. Each row is a unique user-date combination.

  • Data Shape: 392M rows X 9 columns
  • Data Size: 29.1GB

Data Fields:

  • Msno: User ID
  • Date: Date of the logged activity
  • Num_25: Number of songs played < 25% of song length
  • Num_50: Number of songs played between 25% and 50%
  • Num_75: Number of songs played between 50% and 75%
  • Num_985: Number of songs played between 75% and 98.5%
  • Num_100: Number of songs played between 98.5% and 100%
  • Num_unq: Number of unique songs played
  • Total_secs: Total seconds played

Table: Members

Demographic data on each user. Each row represents a unique user.

  • Data Shape: 6.8M rows X 6 columns
  • Data Size: 0.4GB

Data Fields:

  • Msno: User ID
  • City: City of the user
  • BD: Age of the user
  • Gender: Male, Female or Blank
  • Registered_via: Registration method
  • Registration_init_time: Initial time of registration
  • Expiration_date: Expiration of membership

Table: Train

Labels of which users churned. Each row represents a unique user. * Data Shape: 1.0M rows X 2 columns * Data Size: 45MB

Data Fields:

  • Msno: User ID
  • Is_churn: T/F flag variable we are trying to predict.

3. Initial Data Loading

This analysis is performed in the cells below.

#Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Set initial parameter(s)
pd.set_option('display.max_rows', 200)
pd.options.display.max_columns = 2000

Loading the data indexing with the primary key (MSNO: String like/Object, represents the user)

#Load the data
members = pd.read_csv('members_filtered.csv')
transactions = pd.read_csv('transactions_filtered.csv')
user_logs = pd.read_csv('user_logs_filtered.csv')
labels = pd.read_csv('labels_filtered.csv')

#Set indices
members.set_index('msno', inplace = True)
labels.set_index('msno', inplace = True)

user_logs.head()
msno date num_25 num_50 num_75 num_985 num_100 num_unq total_secs
0 MVODUEUlSocm1sXa+zVGpJazPrRFiD4IzEQk0QCdg4U= 20170217 37 2 2 3 30 66 9022.818
1 o3Dg7baW8dXq7Jq7NzlVrWG4mZNVvqp62oWBDO/ybeE= 20160209 36 5 2 3 48 71 13895.453
2 6ERcO7aqAKvrQ2CAvah79dVC7tJVZSjNti1MBfpNVW4= 20151210 26 9 3 0 51 54 13919.805
3 Xt9VAHNtHuST21tkcZSnGKjwv8vF8/COnsf6z28+fKk= 20161025 22 8 4 2 49 75 15147.842
4 zSgTJqoosTiFF7ZZi1DPTHgxLbnd99IgOEsTIDCcZHc= 20160904 26 3 1 0 39 60 10558.829

Performing a quick inspection of the data:

print('Transactions: \n')
transactions.info()

print('User Logs: \n')
user_logs.info()

print('Members: \n')
members.info()

print('User Logs:')
labels.info()
Transactions:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353459 entries, 0 to 1353458
Data columns (total 9 columns):
msno                      1353459 non-null object
payment_method_id         1353459 non-null int64
payment_plan_days         1353459 non-null int64
plan_list_price           1353459 non-null int64
actual_amount_paid        1353459 non-null int64
is_auto_renew             1353459 non-null int64
transaction_date          1353459 non-null int64
membership_expire_date    1353459 non-null int64
is_cancel                 1353459 non-null int64
dtypes: int64(8), object(1)
memory usage: 92.9+ MB
User Logs:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19710631 entries, 0 to 19710630
Data columns (total 9 columns):
msno          object
date          int64
num_25        int64
num_50        int64
num_75        int64
num_985       int64
num_100       int64
num_unq       int64
total_secs    float64
dtypes: float64(1), int64(7), object(1)
memory usage: 1.3+ GB
Members:

<class 'pandas.core.frame.DataFrame'>
Index: 89473 entries, mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= to EFbHYa9/MiKYiyrl05cZ34Cky0FDeHxTYij0pXwkr2A=
Data columns (total 5 columns):
city                      89473 non-null int64
bd                        89473 non-null int64
gender                    46137 non-null object
registered_via            89473 non-null int64
registration_init_time    89473 non-null int64
dtypes: int64(4), object(1)
memory usage: 4.1+ MB
User Logs:
<class 'pandas.core.frame.DataFrame'>
Index: 99825 entries, 3lh94wH+UPK7ENgnA5svzFMYfJJRMZHU/WjgvhRJPzc= to DIgxCOJBeanFdqLOOPMTzwwkqgREVG+g1pwfY5LWvC4=
Data columns (total 1 columns):
is_churn    99825 non-null int64
dtypes: int64(1)
memory usage: 1.5+ MB

Helper routine to format the date for visualization:

def pd_to_date(df_col):
    """Function to convert a pandas dataframe column from %Y%m%d format to datetime format.

    Args:
        df_col (column in a pandas dataframe):  The column to be changed.

    Returns:
        The same column in datetime format.

    """
    df_col = pd.to_datetime(df_col, format = '%Y%m%d')
    return df_col
#Convert date column to date format
user_logs['date'] = pd_to_date(user_logs['date'])

The next two sections prepare the 2 major data tables/frames (User Logs & Transactions) independently and then bring them together for analysis.

4. User Logs Data: Preparation and Feature Extraction

We first create our groupby object to ultimately aggregate data by users:

#Create our groupby user object 
user_logs_gb = user_logs.groupby(['msno'], sort=False)

The next cell creates three new columns:

  • max_date: The latest date each user has a transaction
  • days_before_max_date: The the number of days between the max date and the date of the current record.
  • listening_tenure: The the number of days between the max date and min date of the current user. The hypothesis for this feature is that a user who's been using the service for a long time may be less likely to churn than one who's been using the service for a short time.
#Append max date to every row in main table
user_logs['max_date'] = user_logs_gb['date'].transform('max')
user_logs['days_before_max_date'] = (user_logs['max_date'] - user_logs['date']).apply(lambda x: x.days)
    #The .apply(lambda...  just converts it from datetime to an integer, for easier comparisons later.

#Generate user's first date, last date, and tenure
#Also, the user_logs_features table will be the primary table to return from the transactions table
user_logs_features = (user_logs_gb
    .agg({'date':['max', 'min', lambda x: (max(x) - min(x)).days]})  #.days converts to int
    .rename(columns={'max': 'max_date', 'min': 'min_date','<lambda>':'listening_tenure'})
                      )
#Add a 3rd level, used for joining data later
user_logs_features = pd.concat([user_logs_features], axis=1, keys=['date_features'])

Let's take a look at our initial users table:

user_logs_features.head()
date_features
date
max_date min_date listening_tenure
msno
MVODUEUlSocm1sXa+zVGpJazPrRFiD4IzEQk0QCdg4U= 2017-02-27 2015-07-11 597
o3Dg7baW8dXq7Jq7NzlVrWG4mZNVvqp62oWBDO/ybeE= 2017-02-07 2015-03-10 700
6ERcO7aqAKvrQ2CAvah79dVC7tJVZSjNti1MBfpNVW4= 2017-02-17 2015-01-01 778
Xt9VAHNtHuST21tkcZSnGKjwv8vF8/COnsf6z28+fKk= 2017-02-28 2016-09-08 173
zSgTJqoosTiFF7ZZi1DPTHgxLbnd99IgOEsTIDCcZHc= 2017-02-13 2015-01-01 774

We now create features to look at patters of usage over the past X days, where X is days_before_max_date, to see what a user has been doing "lately". We apply this rationale to all of the usage columns in the user_logs table, giving us combinations of the following elements of our data:

  • Number of songs played < Y% of song length, where Y is 100, 985, 75, 50, and 25, plus the number of unique songs and total seconds played.
  • Activity over the last day, last 7, 30, 90, 180, 365, and total days, noting that the date range is relative to user's most recent activity.

For each of these combinations, we calulate (using groupby and aggregate) both the sum and mean of each feature. And finally we also create a single, total count column (number of rows) for the past number of days. In total, this generates 120 features, which we then append to the user_logs_features table above.

#Create Features:
    # Total X=(seconds, 100, 985, 75, 50, 25, unique), avg per day of X, maybe median per day of X
    # Last day, last 7 days, last 30 days, last 90, 180, 365, total (note last day is relative to user)

for num_days in [1, 7, 14, 31, 90, 180, 365, 9999]:
    #Create groupby object for items with x days
    ul_gb_xdays = (user_logs.loc[(user_logs['days_before_max_date'] < num_days)]
                   .groupby(['msno'], sort=False))

    #Generate sum and mean (and count, once) for all the user logs stats
    past_xdays_by_user = (ul_gb_xdays
        .agg({'num_unq':['sum', 'mean', 'count'],
              'total_secs':['sum', 'mean'],
              'num_25':['sum', 'mean'],
              'num_50':['sum', 'mean'],
              'num_75':['sum', 'mean'],
              'num_985':['sum', 'mean'],
              'num_100':['sum', 'mean'],
             })
                      )
    #Append level header
    past_xdays_by_user = pd.concat([past_xdays_by_user], axis=1, keys=['within_days_' + str(num_days)])

    #Join (append) to user_logs_features table
    user_logs_features = user_logs_features.join(past_xdays_by_user, how='inner')

Taking a quick look at our table now:

user_logs_features.head()
date_features within_days_1 within_days_7 within_days_14 within_days_31 within_days_90 within_days_180 within_days_365 within_days_9999
date num_unq total_secs num_25 num_50 num_75 num_985 num_100 num_unq total_secs num_25 num_50 num_75 num_985 num_100 num_unq total_secs num_25 num_50 num_75 num_985 num_100 num_unq total_secs num_25 num_50 num_75 num_985 num_100 num_unq total_secs num_25 num_50 num_75 num_985 num_100 num_unq total_secs num_25 num_50 num_75 num_985 num_100 num_unq total_secs num_25 num_50 num_75 num_985 num_100 num_unq total_secs num_25 num_50 num_75 num_985 num_100
max_date min_date listening_tenure sum mean count sum mean sum mean sum mean sum mean sum mean sum mean sum mean count sum mean sum mean sum mean sum mean sum mean sum mean sum mean count sum mean sum mean sum mean sum mean sum mean sum mean sum mean count sum mean sum mean sum mean sum mean sum mean sum mean sum mean count sum mean sum mean sum mean sum mean sum mean sum mean sum mean count sum mean sum mean sum mean sum mean sum mean sum mean sum mean count sum mean sum mean sum mean sum mean sum mean sum mean sum mean count sum mean sum mean sum mean sum mean sum mean sum mean
msno
MVODUEUlSocm1sXa+zVGpJazPrRFiD4IzEQk0QCdg4U= 2017-02-27 2015-07-11 597 17 17 1 29802.123 29802.123 17 17 1 1 0 0 1 1 115 115 161 26.833333 6 72502.206 12083.70100 99 16.500000 13 2.166667 11 1.833333 9 1.50 275 45.833333 478 39.833333 12 154157.897 12846.491417 207 17.250000 24 2.000000 16 1.333333 19 1.583333 595 49.583333 1040 41.600000 25 299571.033 11982.841320 448 17.920000 53 2.120000 27 1.080000 43 1.720000 1153 46.120000 2723 43.919355 62 610185.373 9841.699565 1227 19.790323 222 3.580645 103 1.661290 132 2.129032 2244 36.193548 5805 47.975207 121 1195064.265 9876.564174 2792 23.074380 543 4.487603 261 2.157025 252 2.082645 4267 35.264463 10396 45.004329 231 1989225.131 8611.364203 4768 20.640693 1243 5.380952 513 2.220779 479 2.073593 6908 29.904762 17549 46.181579 380 3134336.415 8248.253724 10198 26.836842 2395 6.302632 972 2.557895 806 2.121053 10495 27.618421
o3Dg7baW8dXq7Jq7NzlVrWG4mZNVvqp62oWBDO/ybeE= 2017-02-07 2015-03-10 700 1 1 1 274.176 274.176 0 0 0 0 0 0 0 0 1 1 161 26.833333 6 40311.015 6718.50250 52 8.666667 7 1.166667 7 1.166667 15 2.50 137 22.833333 253 25.300000 10 65631.235 6563.123500 84 8.400000 16 1.600000 16 1.600000 25 2.500000 218 21.800000 811 38.619048 21 201286.586 9585.075524 236 11.238095 43 2.047619 38 1.809524 53 2.523810 699 33.285714 1428 33.209302 43 408254.891 9494.299791 460 10.697674 78 1.813953 69 1.604651 101 2.348837 1430 33.255814 1648 32.313725 51 487842.759 9565.544294 495 9.705882 92 1.803922 76 1.490196 105 2.058824 1735 34.019608 3255 29.062500 112 968253.587 8645.121312 1114 9.946429 206 1.839286 164 1.464286 230 2.053571 3430 30.625000 7475 25.775862 290 2826851.820 9747.764897 3059 10.548276 692 2.386207 535 1.844828 567 1.955172 10075 34.741379
6ERcO7aqAKvrQ2CAvah79dVC7tJVZSjNti1MBfpNVW4= 2017-02-17 2015-01-01 778 13 13 1 10363.972 10363.972 5 5 1 1 1 1 4 4 41 41 219 43.800000 5 68118.548 13623.70960 25 5.000000 6 1.200000 6 1.200000 7 1.40 289 57.800000 480 48.000000 10 121094.373 12109.437300 83 8.300000 24 2.400000 9 0.900000 11 1.100000 498 49.800000 870 43.500000 20 210674.360 10533.718000 190 9.500000 37 1.850000 20 1.000000 28 1.400000 827 41.350000 2390 44.259259 54 604775.769 11199.551278 490 9.074074 116 2.148148 79 1.462963 92 1.703704 2318 42.925926 4427 41.764151 106 1186775.568 11195.995925 750 7.075472 212 2.000000 143 1.349057 179 1.688679 4631 43.688679 9683 41.917749 231 2677487.134 11590.853394 1323 5.727273 425 1.839827 340 1.471861 396 1.714286 10499 45.450216 17589 35.605263 494 4836694.885 9790.880334 2799 5.665992 1023 2.070850 772 1.562753 770 1.558704 18443 37.334008
Xt9VAHNtHuST21tkcZSnGKjwv8vF8/COnsf6z28+fKk= 2017-02-28 2016-09-08 173 86 86 1 21094.770 21094.770 10 10 6 6 7 7 4 4 72 72 133 26.600000 5 32936.237 6587.24740 18 3.600000 8 1.600000 7 1.400000 8 1.60 112 22.400000 313 31.300000 10 72406.047 7240.604700 86 8.600000 18 1.800000 11 1.100000 13 1.300000 251 25.100000 493 21.434783 23 114170.059 4963.915609 126 5.478261 27 1.173913 16 0.695652 16 0.695652 395 17.173913 2060 34.915254 59 406402.278 6888.174203 781 13.237288 116 1.966102 75 1.271186 69 1.169492 1404 23.796610 3840 37.281553 103 799700.121 7764.078845 1387 13.466019 219 2.126214 159 1.543689 168 1.631068 2696 26.174757 3840 37.281553 103 799700.121 7764.078845 1387 13.466019 219 2.126214 159 1.543689 168 1.631068 2696 26.174757 3840 37.281553 103 799700.121 7764.078845 1387 13.466019 219 2.126214 159 1.543689 168 1.631068 2696 26.174757
zSgTJqoosTiFF7ZZi1DPTHgxLbnd99IgOEsTIDCcZHc= 2017-02-13 2015-01-01 774 22 22 1 2803.117 2803.117 9 9 0 0 3 3 0 0 10 10 50 12.500000 4 7935.679 1983.91975 17 4.250000 2 0.500000 3 0.750000 1 0.25 31 7.750000 101 14.428571 7 17605.267 2515.038143 41 5.857143 5 0.714286 4 0.571429 1 0.142857 71 10.142857 301 21.500000 14 40235.862 2873.990143 148 10.571429 15 1.071429 9 0.642857 7 0.500000 152 10.857143 759 23.000000 33 124239.481 3764.832758 284 8.606061 28 0.848485 21 0.636364 24 0.727273 466 14.121212 2722 32.023529 85 461324.690 5427.349294 1185 13.941176 96 1.129412 70 0.823529 123 1.447059 1712 20.141176 9576 46.712195 205 1931032.095 9419.668756 3431 16.736585 401 1.956098 278 1.356098 376 1.834146 7359 35.897561 16913 57.921233 292 3583828.437 12273.385058 5126 17.554795 591 2.023973 440 1.506849 559 1.914384 13815 47.311644

Good, we get the expected number of columns.

5. Transaction Data: Preparation and Feature Extraction

Having completed feature extraction for user logs, we now move on to creating features for the transaction data.

We begin grouping the data by user.

# Grouping by the member (msno)
transactions_gb = transactions.sort_values(["transaction_date"]).groupby(['msno'])

# How many groups i.e. members i.e. msno's. We're good if this is the same number as the members table
print('%d Groups/msnos' %(len(transactions_gb.groups)))
print('%d Features' %(len(transactions.columns)))
99825 Groups/msnos
9 Features

We plan to create the following features from the transactions table: * Simple featuers from the latest transaction * Plan no of days * plan total amount paid * plan list price * Is_auto_renew * is_cancel * Synthetic features from the latest transaction * Plan actual amount paid/day * Aggregate values * Total number of plan days * Total of all the amounts paid for the plan * Comparing transactions * Plan day difference among the latest and previous transaction * Amount paid/day difference among the latest and previous transaction

We begin by creating the total_plan_days and total_amount_paid:

# Features: Total_plan_days, Total_amount_paid
transactions_features = (transactions_gb
    .agg({'payment_plan_days':'sum', 'actual_amount_paid':'sum' })
    .rename(columns={'payment_plan_days': 'total_plan_days', 'actual_amount_paid': 'total_amount_paid',})
          )
print('%d Entries in the DF: ' %(len(transactions_features)))
print('%d Features' %(len(transactions_features.columns)))

transactions_features.head()
99825 Entries in the DF: 
2 Features
total_plan_days total_amount_paid
msno
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw= 543 2831
++5nv+2nsvrWM7dOT+ZiWJ5uTZOzQS0NEvqu3jidTjU= 90 297
++7IULiyKbNc8jllqhRuyKZjX1J4mPF4tsudFCJfv4k= 513 2682
++Ck01c3EF07Ejek2jfXlKut+sEfg+0ry+A5uWeL9vY= 270 891
++FPL1dXZBXC3Cf6gE0HQiIHg1Pd+DBdK7w52xcUmX0= 457 2235

Next, we add amount_paid_per_day for a user's entire tenure:

# Plan actual amount paid/day for all the transactions by a user
# Adding the collumn amount_paid_per_day
transactions_features['amount_paid_per_day'] = (transactions_features['total_amount_paid']
                                                /transactions_features['total_plan_days'])

print('%d Entries in the DF: ' %(len(transactions_features)))
print('%d Features' %(len(transactions_features.columns)))

transactions_features.head()
99825 Entries in the DF: 
3 Features
total_plan_days total_amount_paid amount_paid_per_day
msno
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw= 543 2831 5.213628
++5nv+2nsvrWM7dOT+ZiWJ5uTZOzQS0NEvqu3jidTjU= 90 297 3.300000
++7IULiyKbNc8jllqhRuyKZjX1J4mPF4tsudFCJfv4k= 513 2682 5.228070
++Ck01c3EF07Ejek2jfXlKut+sEfg+0ry+A5uWeL9vY= 270 891 3.300000
++FPL1dXZBXC3Cf6gE0HQiIHg1Pd+DBdK7w52xcUmX0= 457 2235 4.890591

Next, we add latest_payment_method_id, latest_plan_days, latest_plan_list_price, latest_amount_paid, latest_auto_renew, latest_transaction_date, latest_expire_date, and latest_is_cancel. We accomplish this by picking from the bottom of the ordered (by date) rows in groups.

# Features: latest transaction, renaming the collumns
# V1- Fixed the name for plan_list_price collumn (now called latest_plan_list_price)

latest_transaction= transactions_gb.tail([1]).rename(columns={'payment_method_id': 'latest_payment_method_id',
                                                                  'payment_plan_days': 'latest_plan_days',
                                                                  'plan_list_price': 'latest_plan_list_price',
                                                                  'actual_amount_paid': 'latest_amount_paid',
                                                                  'is_auto_renew': 'latest_auto_renew', 
                                                                  'transaction_date': 'latest_transaction_date',
                                                                  'membership_expire_date': 'latest_expire_date',
                                                                  'is_cancel': 'latest_is_cancel' })

# Index by msno
latest_transaction.set_index('msno', inplace = True)

print('%d Entries in the DF: ' %(len(latest_transaction)))
print('%d Features' %(len(latest_transaction.columns)))

latest_transaction.head()
99825 Entries in the DF: 
8 Features
latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_transaction_date latest_expire_date latest_is_cancel
msno
z1Lm/BlRQraiaWJ7RaQWe0+l0Z40ACj7W+zk29FiaS4= 38 30 149 149 0 20150102 20150503 0
IwE/pih8PuqrY/rsnoZ/4TazDliyH9S8VWNc2/d7mJg= 38 30 149 149 0 20150102 20150702 0
ea9rY0uEPY0ImD2QVbYFb+z3zi5wniKWMUM1V8os7OY= 32 410 1788 1788 0 20150104 20170213 0
plhzwjmNJp0HW04NidfVa35JE216RaFYpSeUCwT11zQ= 38 30 149 149 0 20150120 20170103 0
PbSQ2KxR4gRnzjsRd8Up75qMYb70iuMwGk10/jPRljk= 38 360 1200 1200 0 20150123 20170212 0

Next, we add latest_amount_paid_per_day:

# Plan actual amount paid/day for the latest transaction
# Adding the collumn amount_paid_per_day

latest_transaction['latest_amount_paid_per_day'] = (latest_transaction['latest_amount_paid']
                                                /latest_transaction['latest_plan_days'])

print('%d Entries in the DF: ' %(len(latest_transaction)))
print('%d Features' %(len(latest_transaction.columns)))

latest_transaction.head()
99825 Entries in the DF: 
9 Features
latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_transaction_date latest_expire_date latest_is_cancel latest_amount_paid_per_day
msno
z1Lm/BlRQraiaWJ7RaQWe0+l0Z40ACj7W+zk29FiaS4= 38 30 149 149 0 20150102 20150503 0 4.966667
IwE/pih8PuqrY/rsnoZ/4TazDliyH9S8VWNc2/d7mJg= 38 30 149 149 0 20150102 20150702 0 4.966667
ea9rY0uEPY0ImD2QVbYFb+z3zi5wniKWMUM1V8os7OY= 32 410 1788 1788 0 20150104 20170213 0 4.360976
plhzwjmNJp0HW04NidfVa35JE216RaFYpSeUCwT11zQ= 38 30 149 149 0 20150120 20170103 0 4.966667
PbSQ2KxR4gRnzjsRd8Up75qMYb70iuMwGk10/jPRljk= 38 360 1200 1200 0 20150123 20170212 0 3.333333

Next, we compare two different items in our transaction data:

  • Plan duration difference among the last 2 transactons
  • Cost difference among the last 2 transactions
# Getting the 2 latest transactions and grouping by msno again
latest_transaction2_gb = transactions_gb.tail([2]).groupby(['msno'])

# Getting the latest but one transaction
latest2_transaction = latest_transaction2_gb.head([1])

# Index by msno
latest2_transaction.set_index('msno', inplace = True)

# Amount paid per day for the 2nd latest transaction
latest2_transaction['latest2_amount_paid_per_day'] = (latest2_transaction['actual_amount_paid']
                                                /latest2_transaction['payment_plan_days'])

# Difference in the renewal length among the latest 2 transactions
transactions_features['diff_renewal_duration'] = (latest_transaction['latest_plan_days']
                                                - latest2_transaction['payment_plan_days'])

# Different in plan cost among the latest 2 transactions
transactions_features['diff_plan_amount_paid_per_day'] = (latest_transaction['latest_amount_paid_per_day'] 
                                                          - latest2_transaction['latest2_amount_paid_per_day'])

print('%d Entries in the DF: ' %(len(transactions_features)))
print('%d Features' %(len(transactions_features.columns)))

transactions_features.head()
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


99825 Entries in the DF: 
5 Features
total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day
msno
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw= 543 2831 5.213628 0 0.000000
++5nv+2nsvrWM7dOT+ZiWJ5uTZOzQS0NEvqu3jidTjU= 90 297 3.300000 0 0.000000
++7IULiyKbNc8jllqhRuyKZjX1J4mPF4tsudFCJfv4k= 513 2682 5.228070 0 0.000000
++Ck01c3EF07Ejek2jfXlKut+sEfg+0ry+A5uWeL9vY= 270 891 3.300000 0 0.000000
++FPL1dXZBXC3Cf6gE0HQiIHg1Pd+DBdK7w52xcUmX0= 457 2235 4.890591 23 4.966667

Finally, we join all the features into a single data frame:

# Get all transaction features in a single DF
transactions_features = transactions_features.join(latest_transaction, how = 'inner')

# Test
print('%d Entries in the DF: ' %(len(transactions_features)))
print('%d Features' %(len(transactions_features.columns)))

transactions_features.head()
99825 Entries in the DF: 
14 Features
total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_transaction_date latest_expire_date latest_is_cancel latest_amount_paid_per_day
msno
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw= 543 2831 5.213628 0 0.000000 39 30 149 149 1 20170131 20170319 0 4.966667
++5nv+2nsvrWM7dOT+ZiWJ5uTZOzQS0NEvqu3jidTjU= 90 297 3.300000 0 0.000000 41 30 99 99 1 20170201 20170301 0 3.300000
++7IULiyKbNc8jllqhRuyKZjX1J4mPF4tsudFCJfv4k= 513 2682 5.228070 0 0.000000 37 30 149 149 1 20170201 20170301 0 4.966667
++Ck01c3EF07Ejek2jfXlKut+sEfg+0ry+A5uWeL9vY= 270 891 3.300000 0 0.000000 41 30 99 99 1 20170214 20170314 0 3.300000
++FPL1dXZBXC3Cf6gE0HQiIHg1Pd+DBdK7w52xcUmX0= 457 2235 4.890591 23 4.966667 41 30 149 149 1 20160225 20160225 1 4.966667

6. Joining Features and Data Manipulation

Joining Features

Having completed features by user from the User Logs and Transactions tables, we will now join the features from these tables together with the Members and Labels (a.k.a., train) tables into a single data frame for predictive modeling.

First, we'll join the Members and Labels together:

#Join members and labels files
df_fa = None
df_fa = members.join(labels, how='inner')

df_fa.head()
city bd gender registered_via registration_init_time is_churn
msno
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= 1 0 NaN 13 20170120 0
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= 1 0 NaN 13 20160907 0
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= 1 0 NaN 13 20160902 0
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= 1 0 NaN 13 20161028 0
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= 1 0 NaN 13 20161004 0

Next, we join the User Logs features table with the combined Members and the Labels table:

df_fa = df_fa.join(user_logs_features, how='inner')
#Note, the warning is okay, and actually helps us by flattening our column headers.

df_fa.head()
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:558: UserWarning: merging between different levels can give an unintended result (1 levels on the left, 3 on the right)
  warnings.warn(msg, UserWarning)
city bd gender registered_via registration_init_time is_churn (date_features, date, max_date) (date_features, date, min_date) (date_features, date, listening_tenure) (within_days_1, num_unq, sum) (within_days_1, num_unq, mean) (within_days_1, num_unq, count) (within_days_1, total_secs, sum) (within_days_1, total_secs, mean) (within_days_1, num_25, sum) (within_days_1, num_25, mean) (within_days_1, num_50, sum) (within_days_1, num_50, mean) (within_days_1, num_75, sum) (within_days_1, num_75, mean) (within_days_1, num_985, sum) (within_days_1, num_985, mean) (within_days_1, num_100, sum) (within_days_1, num_100, mean) (within_days_7, num_unq, sum) (within_days_7, num_unq, mean) (within_days_7, num_unq, count) (within_days_7, total_secs, sum) (within_days_7, total_secs, mean) (within_days_7, num_25, sum) (within_days_7, num_25, mean) (within_days_7, num_50, sum) (within_days_7, num_50, mean) (within_days_7, num_75, sum) (within_days_7, num_75, mean) (within_days_7, num_985, sum) (within_days_7, num_985, mean) (within_days_7, num_100, sum) (within_days_7, num_100, mean) (within_days_14, num_unq, sum) (within_days_14, num_unq, mean) (within_days_14, num_unq, count) (within_days_14, total_secs, sum) (within_days_14, total_secs, mean) (within_days_14, num_25, sum) (within_days_14, num_25, mean) (within_days_14, num_50, sum) (within_days_14, num_50, mean) (within_days_14, num_75, sum) (within_days_14, num_75, mean) (within_days_14, num_985, sum) (within_days_14, num_985, mean) (within_days_14, num_100, sum) (within_days_14, num_100, mean) (within_days_31, num_unq, sum) (within_days_31, num_unq, mean) (within_days_31, num_unq, count) (within_days_31, total_secs, sum) (within_days_31, total_secs, mean) (within_days_31, num_25, sum) (within_days_31, num_25, mean) (within_days_31, num_50, sum) (within_days_31, num_50, mean) (within_days_31, num_75, sum) (within_days_31, num_75, mean) (within_days_31, num_985, sum) (within_days_31, num_985, mean) (within_days_31, num_100, sum) (within_days_31, num_100, mean) (within_days_90, num_unq, sum) (within_days_90, num_unq, mean) (within_days_90, num_unq, count) (within_days_90, total_secs, sum) (within_days_90, total_secs, mean) (within_days_90, num_25, sum) (within_days_90, num_25, mean) (within_days_90, num_50, sum) (within_days_90, num_50, mean) (within_days_90, num_75, sum) (within_days_90, num_75, mean) (within_days_90, num_985, sum) (within_days_90, num_985, mean) (within_days_90, num_100, sum) (within_days_90, num_100, mean) (within_days_180, num_unq, sum) (within_days_180, num_unq, mean) (within_days_180, num_unq, count) (within_days_180, total_secs, sum) (within_days_180, total_secs, mean) (within_days_180, num_25, sum) (within_days_180, num_25, mean) (within_days_180, num_50, sum) (within_days_180, num_50, mean) (within_days_180, num_75, sum) (within_days_180, num_75, mean) (within_days_180, num_985, sum) (within_days_180, num_985, mean) (within_days_180, num_100, sum) (within_days_180, num_100, mean) (within_days_365, num_unq, sum) (within_days_365, num_unq, mean) (within_days_365, num_unq, count) (within_days_365, total_secs, sum) (within_days_365, total_secs, mean) (within_days_365, num_25, sum) (within_days_365, num_25, mean) (within_days_365, num_50, sum) (within_days_365, num_50, mean) (within_days_365, num_75, sum) (within_days_365, num_75, mean) (within_days_365, num_985, sum) (within_days_365, num_985, mean) (within_days_365, num_100, sum) (within_days_365, num_100, mean) (within_days_9999, num_unq, sum) (within_days_9999, num_unq, mean) (within_days_9999, num_unq, count) (within_days_9999, total_secs, sum) (within_days_9999, total_secs, mean) (within_days_9999, num_25, sum) (within_days_9999, num_25, mean) (within_days_9999, num_50, sum) (within_days_9999, num_50, mean) (within_days_9999, num_75, sum) (within_days_9999, num_75, mean) (within_days_9999, num_985, sum) (within_days_9999, num_985, mean) (within_days_9999, num_100, sum) (within_days_9999, num_100, mean)
msno
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= 1 0 NaN 13 20170120 0 2017-02-24 2017-01-20 35 1 1 1 10.068 10.068 2 2 0 0 0 0 0 0 0 0 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 27 6.750000 4 3158.450 789.612500 33 8.250000 1 0.250000 1 0.250000 1 0.250000 9 2.250000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= 1 0 NaN 13 20160907 0 2017-02-27 2016-09-07 173 21 21 1 2633.631 2633.631 13 13 3 3 1 1 1 1 8 8 228 32.571429 7 32731.138 4675.876857 140 20.000000 29 4.142857 14 2.000000 20 2.857143 95 13.571429 512 36.571429 14 98422.408 7030.172000 301 21.500000 71 5.071429 34 2.428571 42 3.000000 305 21.785714 1044 36.000000 29 178909.861 6169.305552 656 22.620690 135 4.655172 61 2.103448 74 2.551724 571 19.689655 4298 58.876712 73 632743.845 8667.723904 2717 37.219178 393 5.383562 188 2.575342 189 2.589041 2094 28.684932 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= 1 0 NaN 13 20160902 0 2017-02-26 2016-09-02 177 1 1 1 271.093 271.093 0 0 0 0 0 0 0 0 1 1 243 34.714286 7 60581.740 8654.534286 3 0.428571 1 0.142857 2 0.285714 1 0.142857 238 34.000000 489 34.928571 14 122772.792 8769.485143 32 2.285714 2 0.142857 5 0.357143 11 0.785714 476 34.000000 899 32.107143 28 231622.820 8272.243571 73 2.607143 7 0.250000 11 0.392857 14 0.500000 893 31.892857 2396 35.235294 68 770040.608 11324.126588 180 2.647059 32 0.470588 32 0.470588 33 0.485294 2953 43.426471 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= 1 0 NaN 13 20161028 0 2017-02-28 2016-10-28 123 17 17 1 1626.704 1626.704 15 15 1 1 1 1 0 0 6 6 121 24.200000 5 30054.147 6010.829400 29 5.800000 10 2.000000 5 1.000000 2 0.400000 123 24.600000 192 17.454545 11 43518.795 3956.254091 44 4.000000 11 1.000000 7 0.636364 7 0.636364 174 15.818182 457 17.576923 26 111841.140 4301.582308 80 3.076923 23 0.884615 15 0.576923 15 0.576923 456 17.538462 1229 21.189655 58 287422.839 4955.566190 203 3.500000 81 1.396552 55 0.948276 60 1.034483 1145 19.741379 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= 1 0 NaN 13 20161004 0 2016-10-26 2016-10-04 22 1 1 1 156.204 156.204 0 0 0 0 1 1 0 0 0 0 14 7.000000 2 2399.824 1199.912000 4 2.000000 5 2.500000 4 2.000000 0 0.000000 5 2.500000 17 4.250000 4 2630.818 657.704500 5 1.250000 7 1.750000 4 1.000000 0 0.000000 5 1.250000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000

Finally, we'll join in our Transaction features:

# Joining feature DF's
df_fa = df_fa.join(transactions_features, how='inner')

print('%d Entries in the DF: ' %(len(df_fa)))
print('%d Features' %(len(df_fa.columns)))
df_fa.head()
88544 Entries in the DF: 
143 Features
city bd gender registered_via registration_init_time is_churn (date_features, date, max_date) (date_features, date, min_date) (date_features, date, listening_tenure) (within_days_1, num_unq, sum) (within_days_1, num_unq, mean) (within_days_1, num_unq, count) (within_days_1, total_secs, sum) (within_days_1, total_secs, mean) (within_days_1, num_25, sum) (within_days_1, num_25, mean) (within_days_1, num_50, sum) (within_days_1, num_50, mean) (within_days_1, num_75, sum) (within_days_1, num_75, mean) (within_days_1, num_985, sum) (within_days_1, num_985, mean) (within_days_1, num_100, sum) (within_days_1, num_100, mean) (within_days_7, num_unq, sum) (within_days_7, num_unq, mean) (within_days_7, num_unq, count) (within_days_7, total_secs, sum) (within_days_7, total_secs, mean) (within_days_7, num_25, sum) (within_days_7, num_25, mean) (within_days_7, num_50, sum) (within_days_7, num_50, mean) (within_days_7, num_75, sum) (within_days_7, num_75, mean) (within_days_7, num_985, sum) (within_days_7, num_985, mean) (within_days_7, num_100, sum) (within_days_7, num_100, mean) (within_days_14, num_unq, sum) (within_days_14, num_unq, mean) (within_days_14, num_unq, count) (within_days_14, total_secs, sum) (within_days_14, total_secs, mean) (within_days_14, num_25, sum) (within_days_14, num_25, mean) (within_days_14, num_50, sum) (within_days_14, num_50, mean) (within_days_14, num_75, sum) (within_days_14, num_75, mean) (within_days_14, num_985, sum) (within_days_14, num_985, mean) (within_days_14, num_100, sum) (within_days_14, num_100, mean) (within_days_31, num_unq, sum) (within_days_31, num_unq, mean) (within_days_31, num_unq, count) (within_days_31, total_secs, sum) (within_days_31, total_secs, mean) (within_days_31, num_25, sum) (within_days_31, num_25, mean) (within_days_31, num_50, sum) (within_days_31, num_50, mean) (within_days_31, num_75, sum) (within_days_31, num_75, mean) (within_days_31, num_985, sum) (within_days_31, num_985, mean) (within_days_31, num_100, sum) (within_days_31, num_100, mean) (within_days_90, num_unq, sum) (within_days_90, num_unq, mean) (within_days_90, num_unq, count) (within_days_90, total_secs, sum) (within_days_90, total_secs, mean) (within_days_90, num_25, sum) (within_days_90, num_25, mean) (within_days_90, num_50, sum) (within_days_90, num_50, mean) (within_days_90, num_75, sum) (within_days_90, num_75, mean) (within_days_90, num_985, sum) (within_days_90, num_985, mean) (within_days_90, num_100, sum) (within_days_90, num_100, mean) (within_days_180, num_unq, sum) (within_days_180, num_unq, mean) (within_days_180, num_unq, count) (within_days_180, total_secs, sum) (within_days_180, total_secs, mean) (within_days_180, num_25, sum) (within_days_180, num_25, mean) (within_days_180, num_50, sum) (within_days_180, num_50, mean) (within_days_180, num_75, sum) (within_days_180, num_75, mean) (within_days_180, num_985, sum) (within_days_180, num_985, mean) (within_days_180, num_100, sum) (within_days_180, num_100, mean) (within_days_365, num_unq, sum) (within_days_365, num_unq, mean) (within_days_365, num_unq, count) (within_days_365, total_secs, sum) (within_days_365, total_secs, mean) (within_days_365, num_25, sum) (within_days_365, num_25, mean) (within_days_365, num_50, sum) (within_days_365, num_50, mean) (within_days_365, num_75, sum) (within_days_365, num_75, mean) (within_days_365, num_985, sum) (within_days_365, num_985, mean) (within_days_365, num_100, sum) (within_days_365, num_100, mean) (within_days_9999, num_unq, sum) (within_days_9999, num_unq, mean) (within_days_9999, num_unq, count) (within_days_9999, total_secs, sum) (within_days_9999, total_secs, mean) (within_days_9999, num_25, sum) (within_days_9999, num_25, mean) (within_days_9999, num_50, sum) (within_days_9999, num_50, mean) (within_days_9999, num_75, sum) (within_days_9999, num_75, mean) (within_days_9999, num_985, sum) (within_days_9999, num_985, mean) (within_days_9999, num_100, sum) (within_days_9999, num_100, mean) total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_transaction_date latest_expire_date latest_is_cancel latest_amount_paid_per_day
msno
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= 1 0 NaN 13 20170120 0 2017-02-24 2017-01-20 35 1 1 1 10.068 10.068 2 2 0 0 0 0 0 0 0 0 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 27 6.750000 4 3158.450 789.612500 33 8.250000 1 0.250000 1 0.250000 1 0.250000 9 2.250000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 60 258 4.300000 0 0.0 30 30 129 129 1 20170220 20170319 0 4.300000
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= 1 0 NaN 13 20160907 0 2017-02-27 2016-09-07 173 21 21 1 2633.631 2633.631 13 13 3 3 1 1 1 1 8 8 228 32.571429 7 32731.138 4675.876857 140 20.000000 29 4.142857 14 2.000000 20 2.857143 95 13.571429 512 36.571429 14 98422.408 7030.172000 301 21.500000 71 5.071429 34 2.428571 42 3.000000 305 21.785714 1044 36.000000 29 178909.861 6169.305552 656 22.620690 135 4.655172 61 2.103448 74 2.551724 571 19.689655 4298 58.876712 73 632743.845 8667.723904 2717 37.219178 393 5.383562 188 2.575342 189 2.589041 2094 28.684932 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 180 774 4.300000 0 0.0 30 30 129 129 1 20170207 20170306 0 4.300000
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= 1 0 NaN 13 20160902 0 2017-02-26 2016-09-02 177 1 1 1 271.093 271.093 0 0 0 0 0 0 0 0 1 1 243 34.714286 7 60581.740 8654.534286 3 0.428571 1 0.142857 2 0.285714 1 0.142857 238 34.000000 489 34.928571 14 122772.792 8769.485143 32 2.285714 2 0.142857 5 0.357143 11 0.785714 476 34.000000 899 32.107143 28 231622.820 8272.243571 73 2.607143 7 0.250000 11 0.392857 14 0.500000 893 31.892857 2396 35.235294 68 770040.608 11324.126588 180 2.647059 32 0.470588 32 0.470588 33 0.485294 2953 43.426471 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 180 774 4.300000 0 0.0 30 30 129 129 1 20170202 20170301 0 4.300000
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= 1 0 NaN 13 20161028 0 2017-02-28 2016-10-28 123 17 17 1 1626.704 1626.704 15 15 1 1 1 1 0 0 6 6 121 24.200000 5 30054.147 6010.829400 29 5.800000 10 2.000000 5 1.000000 2 0.400000 123 24.600000 192 17.454545 11 43518.795 3956.254091 44 4.000000 11 1.000000 7 0.636364 7 0.636364 174 15.818182 457 17.576923 26 111841.140 4301.582308 80 3.076923 23 0.884615 15 0.576923 15 0.576923 456 17.538462 1229 21.189655 58 287422.839 4955.566190 203 3.500000 81 1.396552 55 0.948276 60 1.034483 1145 19.741379 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 150 596 3.973333 0 0.0 30 30 149 149 1 20170228 20170327 0 4.966667
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= 1 0 NaN 13 20161004 0 2016-10-26 2016-10-04 22 1 1 1 156.204 156.204 0 0 0 0 1 1 0 0 0 0 14 7.000000 2 2399.824 1199.912000 4 2.000000 5 2.500000 4 2.000000 0 0.000000 5 2.500000 17 4.250000 4 2630.818 657.704500 5 1.250000 7 1.750000 4 1.000000 0 0.000000 5 1.250000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 150 645 4.300000 0 0.0 30 30 129 129 1 20170204 20170303 0 4.300000

Data Manipulation

Having joined all the features into a single file, we will now perform some data manipulation tasks to prepare the table for predictive modeling.

First, we will fix the column headers:

#Fix column headers
df_fa.columns = df_fa.columns.map(''.join)
df_fa.head()
city bd gender registered_via registration_init_time is_churn date_featuresdatemax_date date_featuresdatemin_date date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_transaction_date latest_expire_date latest_is_cancel latest_amount_paid_per_day
msno
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= 1 0 NaN 13 20170120 0 2017-02-24 2017-01-20 35 1 1 1 10.068 10.068 2 2 0 0 0 0 0 0 0 0 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 27 6.750000 4 3158.450 789.612500 33 8.250000 1 0.250000 1 0.250000 1 0.250000 9 2.250000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 60 258 4.300000 0 0.0 30 30 129 129 1 20170220 20170319 0 4.300000
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= 1 0 NaN 13 20160907 0 2017-02-27 2016-09-07 173 21 21 1 2633.631 2633.631 13 13 3 3 1 1 1 1 8 8 228 32.571429 7 32731.138 4675.876857 140 20.000000 29 4.142857 14 2.000000 20 2.857143 95 13.571429 512 36.571429 14 98422.408 7030.172000 301 21.500000 71 5.071429 34 2.428571 42 3.000000 305 21.785714 1044 36.000000 29 178909.861 6169.305552 656 22.620690 135 4.655172 61 2.103448 74 2.551724 571 19.689655 4298 58.876712 73 632743.845 8667.723904 2717 37.219178 393 5.383562 188 2.575342 189 2.589041 2094 28.684932 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 180 774 4.300000 0 0.0 30 30 129 129 1 20170207 20170306 0 4.300000
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= 1 0 NaN 13 20160902 0 2017-02-26 2016-09-02 177 1 1 1 271.093 271.093 0 0 0 0 0 0 0 0 1 1 243 34.714286 7 60581.740 8654.534286 3 0.428571 1 0.142857 2 0.285714 1 0.142857 238 34.000000 489 34.928571 14 122772.792 8769.485143 32 2.285714 2 0.142857 5 0.357143 11 0.785714 476 34.000000 899 32.107143 28 231622.820 8272.243571 73 2.607143 7 0.250000 11 0.392857 14 0.500000 893 31.892857 2396 35.235294 68 770040.608 11324.126588 180 2.647059 32 0.470588 32 0.470588 33 0.485294 2953 43.426471 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 180 774 4.300000 0 0.0 30 30 129 129 1 20170202 20170301 0 4.300000
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= 1 0 NaN 13 20161028 0 2017-02-28 2016-10-28 123 17 17 1 1626.704 1626.704 15 15 1 1 1 1 0 0 6 6 121 24.200000 5 30054.147 6010.829400 29 5.800000 10 2.000000 5 1.000000 2 0.400000 123 24.600000 192 17.454545 11 43518.795 3956.254091 44 4.000000 11 1.000000 7 0.636364 7 0.636364 174 15.818182 457 17.576923 26 111841.140 4301.582308 80 3.076923 23 0.884615 15 0.576923 15 0.576923 456 17.538462 1229 21.189655 58 287422.839 4955.566190 203 3.500000 81 1.396552 55 0.948276 60 1.034483 1145 19.741379 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 150 596 3.973333 0 0.0 30 30 149 149 1 20170228 20170327 0 4.966667
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= 1 0 NaN 13 20161004 0 2016-10-26 2016-10-04 22 1 1 1 156.204 156.204 0 0 0 0 1 1 0 0 0 0 14 7.000000 2 2399.824 1199.912000 4 2.000000 5 2.500000 4 2.000000 0 0.000000 5 2.500000 17 4.250000 4 2630.818 657.704500 5 1.250000 7 1.750000 4 1.000000 0 0.000000 5 1.250000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 150 645 4.300000 0 0.0 30 30 129 129 1 20170204 20170303 0 4.300000

Next, we will change infinite and 'na' values to -9999, wildly different than other values in the range, so that our algorithms see them as 'different'.

#Handle bad values
df_fa['amount_paid_per_day'].replace([np.inf, -np.inf], -9999, inplace=True)
df_fa['latest_amount_paid_per_day'].replace([np.inf, -np.inf], -9999, inplace=True)
df_fa['diff_plan_amount_paid_per_day'].replace([np.inf, -np.inf], -9999, inplace=True)
df_fa['diff_plan_amount_paid_per_day'].fillna(-9999, inplace=True)
df_fa.isnull().any()
city                                 False
bd                                   False
gender                                True
registered_via                       False
registration_init_time               False
is_churn                             False
date_featuresdatemax_date            False
date_featuresdatemin_date            False
date_featuresdatelistening_tenure    False
within_days_1num_unqsum              False
within_days_1num_unqmean             False
within_days_1num_unqcount            False
within_days_1total_secssum           False
within_days_1total_secsmean          False
within_days_1num_25sum               False
within_days_1num_25mean              False
within_days_1num_50sum               False
within_days_1num_50mean              False
within_days_1num_75sum               False
within_days_1num_75mean              False
within_days_1num_985sum              False
within_days_1num_985mean             False
within_days_1num_100sum              False
within_days_1num_100mean             False
within_days_7num_unqsum              False
within_days_7num_unqmean             False
within_days_7num_unqcount            False
within_days_7total_secssum           False
within_days_7total_secsmean          False
within_days_7num_25sum               False
within_days_7num_25mean              False
within_days_7num_50sum               False
within_days_7num_50mean              False
within_days_7num_75sum               False
within_days_7num_75mean              False
within_days_7num_985sum              False
within_days_7num_985mean             False
within_days_7num_100sum              False
within_days_7num_100mean             False
within_days_14num_unqsum             False
within_days_14num_unqmean            False
within_days_14num_unqcount           False
within_days_14total_secssum          False
within_days_14total_secsmean         False
within_days_14num_25sum              False
within_days_14num_25mean             False
within_days_14num_50sum              False
within_days_14num_50mean             False
within_days_14num_75sum              False
within_days_14num_75mean             False
within_days_14num_985sum             False
within_days_14num_985mean            False
within_days_14num_100sum             False
within_days_14num_100mean            False
within_days_31num_unqsum             False
within_days_31num_unqmean            False
within_days_31num_unqcount           False
within_days_31total_secssum          False
within_days_31total_secsmean         False
within_days_31num_25sum              False
within_days_31num_25mean             False
within_days_31num_50sum              False
within_days_31num_50mean             False
within_days_31num_75sum              False
within_days_31num_75mean             False
within_days_31num_985sum             False
within_days_31num_985mean            False
within_days_31num_100sum             False
within_days_31num_100mean            False
within_days_90num_unqsum             False
within_days_90num_unqmean            False
within_days_90num_unqcount           False
within_days_90total_secssum          False
within_days_90total_secsmean         False
within_days_90num_25sum              False
within_days_90num_25mean             False
within_days_90num_50sum              False
within_days_90num_50mean             False
within_days_90num_75sum              False
within_days_90num_75mean             False
within_days_90num_985sum             False
within_days_90num_985mean            False
within_days_90num_100sum             False
within_days_90num_100mean            False
within_days_180num_unqsum            False
within_days_180num_unqmean           False
within_days_180num_unqcount          False
within_days_180total_secssum         False
within_days_180total_secsmean        False
within_days_180num_25sum             False
within_days_180num_25mean            False
within_days_180num_50sum             False
within_days_180num_50mean            False
within_days_180num_75sum             False
within_days_180num_75mean            False
within_days_180num_985sum            False
within_days_180num_985mean           False
within_days_180num_100sum            False
within_days_180num_100mean           False
within_days_365num_unqsum            False
within_days_365num_unqmean           False
within_days_365num_unqcount          False
within_days_365total_secssum         False
within_days_365total_secsmean        False
within_days_365num_25sum             False
within_days_365num_25mean            False
within_days_365num_50sum             False
within_days_365num_50mean            False
within_days_365num_75sum             False
within_days_365num_75mean            False
within_days_365num_985sum            False
within_days_365num_985mean           False
within_days_365num_100sum            False
within_days_365num_100mean           False
within_days_9999num_unqsum           False
within_days_9999num_unqmean          False
within_days_9999num_unqcount         False
within_days_9999total_secssum        False
within_days_9999total_secsmean       False
within_days_9999num_25sum            False
within_days_9999num_25mean           False
within_days_9999num_50sum            False
within_days_9999num_50mean           False
within_days_9999num_75sum            False
within_days_9999num_75mean           False
within_days_9999num_985sum           False
within_days_9999num_985mean          False
within_days_9999num_100sum           False
within_days_9999num_100mean          False
total_plan_days                      False
total_amount_paid                    False
amount_paid_per_day                  False
diff_renewal_duration                False
diff_plan_amount_paid_per_day        False
latest_payment_method_id             False
latest_plan_days                     False
latest_plan_list_price               False
latest_amount_paid                   False
latest_auto_renew                    False
latest_transaction_date              False
latest_expire_date                   False
latest_is_cancel                     False
latest_amount_paid_per_day           False
dtype: bool

The cell above uploads verifies we have no null values in our data.

Now let's inspect our data types:

df_fa.dtypes
city                                          int64
bd                                            int64
gender                                       object
registered_via                                int64
registration_init_time                        int64
is_churn                                      int64
date_featuresdatemax_date            datetime64[ns]
date_featuresdatemin_date            datetime64[ns]
date_featuresdatelistening_tenure             int64
within_days_1num_unqsum                       int64
within_days_1num_unqmean                      int64
within_days_1num_unqcount                     int64
within_days_1total_secssum                  float64
within_days_1total_secsmean                 float64
within_days_1num_25sum                        int64
within_days_1num_25mean                       int64
within_days_1num_50sum                        int64
within_days_1num_50mean                       int64
within_days_1num_75sum                        int64
within_days_1num_75mean                       int64
within_days_1num_985sum                       int64
within_days_1num_985mean                      int64
within_days_1num_100sum                       int64
within_days_1num_100mean                      int64
within_days_7num_unqsum                       int64
within_days_7num_unqmean                    float64
within_days_7num_unqcount                     int64
within_days_7total_secssum                  float64
within_days_7total_secsmean                 float64
within_days_7num_25sum                        int64
within_days_7num_25mean                     float64
within_days_7num_50sum                        int64
within_days_7num_50mean                     float64
within_days_7num_75sum                        int64
within_days_7num_75mean                     float64
within_days_7num_985sum                       int64
within_days_7num_985mean                    float64
within_days_7num_100sum                       int64
within_days_7num_100mean                    float64
within_days_14num_unqsum                      int64
within_days_14num_unqmean                   float64
within_days_14num_unqcount                    int64
within_days_14total_secssum                 float64
within_days_14total_secsmean                float64
within_days_14num_25sum                       int64
within_days_14num_25mean                    float64
within_days_14num_50sum                       int64
within_days_14num_50mean                    float64
within_days_14num_75sum                       int64
within_days_14num_75mean                    float64
within_days_14num_985sum                      int64
within_days_14num_985mean                   float64
within_days_14num_100sum                      int64
within_days_14num_100mean                   float64
within_days_31num_unqsum                      int64
within_days_31num_unqmean                   float64
within_days_31num_unqcount                    int64
within_days_31total_secssum                 float64
within_days_31total_secsmean                float64
within_days_31num_25sum                       int64
within_days_31num_25mean                    float64
within_days_31num_50sum                       int64
within_days_31num_50mean                    float64
within_days_31num_75sum                       int64
within_days_31num_75mean                    float64
within_days_31num_985sum                      int64
within_days_31num_985mean                   float64
within_days_31num_100sum                      int64
within_days_31num_100mean                   float64
within_days_90num_unqsum                      int64
within_days_90num_unqmean                   float64
within_days_90num_unqcount                    int64
within_days_90total_secssum                 float64
within_days_90total_secsmean                float64
within_days_90num_25sum                       int64
within_days_90num_25mean                    float64
within_days_90num_50sum                       int64
within_days_90num_50mean                    float64
within_days_90num_75sum                       int64
within_days_90num_75mean                    float64
within_days_90num_985sum                      int64
within_days_90num_985mean                   float64
within_days_90num_100sum                      int64
within_days_90num_100mean                   float64
within_days_180num_unqsum                     int64
within_days_180num_unqmean                  float64
within_days_180num_unqcount                   int64
within_days_180total_secssum                float64
within_days_180total_secsmean               float64
within_days_180num_25sum                      int64
within_days_180num_25mean                   float64
within_days_180num_50sum                      int64
within_days_180num_50mean                   float64
within_days_180num_75sum                      int64
within_days_180num_75mean                   float64
within_days_180num_985sum                     int64
within_days_180num_985mean                  float64
within_days_180num_100sum                     int64
within_days_180num_100mean                  float64
within_days_365num_unqsum                     int64
within_days_365num_unqmean                  float64
within_days_365num_unqcount                   int64
within_days_365total_secssum                float64
within_days_365total_secsmean               float64
within_days_365num_25sum                      int64
within_days_365num_25mean                   float64
within_days_365num_50sum                      int64
within_days_365num_50mean                   float64
within_days_365num_75sum                      int64
within_days_365num_75mean                   float64
within_days_365num_985sum                     int64
within_days_365num_985mean                  float64
within_days_365num_100sum                     int64
within_days_365num_100mean                  float64
within_days_9999num_unqsum                    int64
within_days_9999num_unqmean                 float64
within_days_9999num_unqcount                  int64
within_days_9999total_secssum               float64
within_days_9999total_secsmean              float64
within_days_9999num_25sum                     int64
within_days_9999num_25mean                  float64
within_days_9999num_50sum                     int64
within_days_9999num_50mean                  float64
within_days_9999num_75sum                     int64
within_days_9999num_75mean                  float64
within_days_9999num_985sum                    int64
within_days_9999num_985mean                 float64
within_days_9999num_100sum                    int64
within_days_9999num_100mean                 float64
total_plan_days                               int64
total_amount_paid                             int64
amount_paid_per_day                         float64
diff_renewal_duration                         int64
diff_plan_amount_paid_per_day               float64
latest_payment_method_id                      int64
latest_plan_days                              int64
latest_plan_list_price                        int64
latest_amount_paid                            int64
latest_auto_renew                             int64
latest_transaction_date                       int64
latest_expire_date                            int64
latest_is_cancel                              int64
latest_amount_paid_per_day                  float64
dtype: object

We see we have a couple datetime objects in the file. We'll need to address these, as the ML algorithms don't like them. The code below breaks datetime formatted columns up into 4 separate columns.

def split_date_col(date_col_name):
    """Function that takes a column of datetime64[ns] items and converts it into 4 columns:
        1) Year integer
        2) Month integer
        3) Day integer
        4) Days since January 1, 2001, as an integer

        It then deletes the original date 

    Args:
        date_col_name (string):  The column name, as a string.
    """
    df_fa[date_col_name + '_year'] = df_fa[date_col_name].dt.year
    df_fa[date_col_name + '_month'] = df_fa[date_col_name].dt.month
    df_fa[date_col_name + '_day'] = df_fa[date_col_name].dt.day
    df_fa[date_col_name + '_absday'] = ((df_fa[date_col_name] - pd.to_datetime('1/1/2000'))
                                      .astype('timedelta64[D]')
                                      .astype('int64')
                                     )
    df_fa.drop(date_col_name, axis=1, inplace=True)
#Only run this cell once, else it will fail on the date columns it deletes
split_date_col('date_featuresdatemax_date')
split_date_col('date_featuresdatemin_date')

Now let's re-check our cells:

df_fa.dtypes
city                                   int64
bd                                     int64
gender                                object
registered_via                         int64
registration_init_time                 int64
is_churn                               int64
date_featuresdatelistening_tenure      int64
within_days_1num_unqsum                int64
within_days_1num_unqmean               int64
within_days_1num_unqcount              int64
within_days_1total_secssum           float64
within_days_1total_secsmean          float64
within_days_1num_25sum                 int64
within_days_1num_25mean                int64
within_days_1num_50sum                 int64
within_days_1num_50mean                int64
within_days_1num_75sum                 int64
within_days_1num_75mean                int64
within_days_1num_985sum                int64
within_days_1num_985mean               int64
within_days_1num_100sum                int64
within_days_1num_100mean               int64
within_days_7num_unqsum                int64
within_days_7num_unqmean             float64
within_days_7num_unqcount              int64
within_days_7total_secssum           float64
within_days_7total_secsmean          float64
within_days_7num_25sum                 int64
within_days_7num_25mean              float64
within_days_7num_50sum                 int64
within_days_7num_50mean              float64
within_days_7num_75sum                 int64
within_days_7num_75mean              float64
within_days_7num_985sum                int64
within_days_7num_985mean             float64
within_days_7num_100sum                int64
within_days_7num_100mean             float64
within_days_14num_unqsum               int64
within_days_14num_unqmean            float64
within_days_14num_unqcount             int64
within_days_14total_secssum          float64
within_days_14total_secsmean         float64
within_days_14num_25sum                int64
within_days_14num_25mean             float64
within_days_14num_50sum                int64
within_days_14num_50mean             float64
within_days_14num_75sum                int64
within_days_14num_75mean             float64
within_days_14num_985sum               int64
within_days_14num_985mean            float64
within_days_14num_100sum               int64
within_days_14num_100mean            float64
within_days_31num_unqsum               int64
within_days_31num_unqmean            float64
within_days_31num_unqcount             int64
within_days_31total_secssum          float64
within_days_31total_secsmean         float64
within_days_31num_25sum                int64
within_days_31num_25mean             float64
within_days_31num_50sum                int64
within_days_31num_50mean             float64
within_days_31num_75sum                int64
within_days_31num_75mean             float64
within_days_31num_985sum               int64
within_days_31num_985mean            float64
within_days_31num_100sum               int64
within_days_31num_100mean            float64
within_days_90num_unqsum               int64
within_days_90num_unqmean            float64
within_days_90num_unqcount             int64
within_days_90total_secssum          float64
within_days_90total_secsmean         float64
within_days_90num_25sum                int64
within_days_90num_25mean             float64
within_days_90num_50sum                int64
within_days_90num_50mean             float64
within_days_90num_75sum                int64
within_days_90num_75mean             float64
within_days_90num_985sum               int64
within_days_90num_985mean            float64
within_days_90num_100sum               int64
within_days_90num_100mean            float64
within_days_180num_unqsum              int64
within_days_180num_unqmean           float64
within_days_180num_unqcount            int64
within_days_180total_secssum         float64
within_days_180total_secsmean        float64
within_days_180num_25sum               int64
within_days_180num_25mean            float64
within_days_180num_50sum               int64
within_days_180num_50mean            float64
within_days_180num_75sum               int64
within_days_180num_75mean            float64
within_days_180num_985sum              int64
within_days_180num_985mean           float64
within_days_180num_100sum              int64
within_days_180num_100mean           float64
within_days_365num_unqsum              int64
within_days_365num_unqmean           float64
within_days_365num_unqcount            int64
within_days_365total_secssum         float64
within_days_365total_secsmean        float64
within_days_365num_25sum               int64
within_days_365num_25mean            float64
within_days_365num_50sum               int64
within_days_365num_50mean            float64
within_days_365num_75sum               int64
within_days_365num_75mean            float64
within_days_365num_985sum              int64
within_days_365num_985mean           float64
within_days_365num_100sum              int64
within_days_365num_100mean           float64
within_days_9999num_unqsum             int64
within_days_9999num_unqmean          float64
within_days_9999num_unqcount           int64
within_days_9999total_secssum        float64
within_days_9999total_secsmean       float64
within_days_9999num_25sum              int64
within_days_9999num_25mean           float64
within_days_9999num_50sum              int64
within_days_9999num_50mean           float64
within_days_9999num_75sum              int64
within_days_9999num_75mean           float64
within_days_9999num_985sum             int64
within_days_9999num_985mean          float64
within_days_9999num_100sum             int64
within_days_9999num_100mean          float64
total_plan_days                        int64
total_amount_paid                      int64
amount_paid_per_day                  float64
diff_renewal_duration                  int64
diff_plan_amount_paid_per_day        float64
latest_payment_method_id               int64
latest_plan_days                       int64
latest_plan_list_price                 int64
latest_amount_paid                     int64
latest_auto_renew                      int64
latest_transaction_date                int64
latest_expire_date                     int64
latest_is_cancel                       int64
latest_amount_paid_per_day           float64
date_featuresdatemax_date_year         int64
date_featuresdatemax_date_month        int64
date_featuresdatemax_date_day          int64
date_featuresdatemax_date_absday       int64
date_featuresdatemin_date_year         int64
date_featuresdatemin_date_month        int64
date_featuresdatemin_date_day          int64
date_featuresdatemin_date_absday       int64
dtype: object
df_fa.describe(include='all')
city bd gender registered_via registration_init_time is_churn date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_transaction_date latest_expire_date latest_is_cancel latest_amount_paid_per_day date_featuresdatemax_date_year date_featuresdatemax_date_month date_featuresdatemax_date_day date_featuresdatemax_date_absday date_featuresdatemin_date_year date_featuresdatemin_date_month date_featuresdatemin_date_day date_featuresdatemin_date_absday
count 88544.000000 88544.000000 45906 88544.000000 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.0 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000
unique NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top NaN NaN male NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq NaN NaN 24390 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 6.525038 14.980733 NaN 6.635018 2.013375e+07 0.505590 489.896266 20.698794 20.698794 1.0 5292.578450 5292.578450 4.751005 4.751005 1.181175 1.181175 0.712719 0.712719 0.790985 0.790985 20.047660 20.047660 94.714458 21.900082 3.630568 25398.043300 5679.751873 20.606806 5.048088 5.118721 1.264542 3.193474 0.764835 3.609866 0.839836 96.809236 21.497445 180.072506 22.383794 6.679696 4.851858e+04 5818.581592 38.873690 5.179982 9.669735 1.298046 6.069570 0.785710 6.831553 0.857796 185.109313 22.017982 386.728101 23.271521 13.902444 1.039538e+05 6040.184570 84.352706 5.450459 20.850831 1.359949 13.151473 0.825081 14.948783 0.904241 396.595557 22.841633 1093.917736 24.238664 37.852435 -8.333365e+11 -1.651270e+10 241.791143 5.797655 59.157650 1.442505 37.497256 0.874044 42.036321 0.943332 1121.233093 23.773736 2090.805633 24.740990 71.289811 -1.979174e+12 -2.291994e+10 464.508132 5.968142 114.264603 1.498918 72.136554 0.902613 80.350267 0.967418 2124.856648 24.115101 3896.207400 25.245055 131.113401 -6.041691e+12 -3.677399e+10 859.351520 6.090492 212.907018 1.541631 133.405527 0.921047 148.187715 0.979698 3949.711895 24.569486 6768.144403 25.652204 222.584850 -2.994804e+14 -7.498238e+11 1476.962256 6.214718 367.734708 1.582176 229.020227 0.940062 254.208992 0.992622 6864.914574 24.937062 437.796169 2037.669125 4.602800 2.989858 -5.606482 37.700296 52.462742 227.611199 227.036185 0.639874 2.016869e+07 2.016998e+07 0.174659 4.249737 2016.881234 2.525117 21.763507 6234.178363 2015.427177 4.265992 11.133335 5744.282097
std 6.551445 18.431336 NaN 2.234529 2.896439e+04 0.499972 271.954761 26.816538 26.816538 0.0 7839.883628 7839.883628 11.209580 11.209580 3.122491 3.122491 1.569186 1.569186 2.056548 2.056548 31.926074 31.926074 123.736267 22.179267 2.019227 37825.652389 6814.978016 40.174639 9.170749 9.780551 2.416056 5.525064 1.240117 8.967416 1.630039 154.190616 27.963725 232.545094 20.836810 4.052415 7.152130e+04 6539.990675 69.286828 8.255339 16.423555 2.159017 9.495713 1.138329 16.007510 1.546520 291.280077 26.898005 477.079939 19.372295 8.663977 1.469971e+05 6020.766839 135.636164 7.741246 30.899691 1.919181 18.358084 1.029136 36.104464 1.563621 600.717536 24.834585 1341.237189 18.684102 25.324540 2.191767e+14 3.964256e+12 382.272707 7.515411 82.445972 1.770876 50.426151 0.969375 89.884588 1.397562 1663.191691 24.005779 2583.529737 18.290344 50.173053 2.754958e+14 2.817791e+12 890.415732 7.904848 158.377603 1.721930 96.141760 0.912695 166.964108 1.336536 3129.427760 23.396972 4890.713126 17.842956 99.124183 3.640763e+14 2.148897e+12 2152.429448 8.587733 298.543419 1.634560 179.646957 0.882435 271.766276 1.137067 5839.985612 22.801391 9203.006883 17.417302 196.804976 3.620312e+15 9.459203e+12 3814.395114 8.261136 548.144978 1.622525 327.643138 0.872478 440.184627 1.023119 10865.335596 22.230638 241.813007 1234.637258 1.085134 51.117939 237.546015 4.484667 79.217446 332.886694 333.154660 0.480039 3.996365e+03 6.095780e+03 0.379677 47.556868 0.372700 2.183868 8.070249 94.115610 0.580881 3.899945 9.493778 261.169278
min 1.000000 -49.000000 NaN 3.000000 2.004033e+07 0.000000 0.000000 1.000000 1.000000 1.0 0.030000 0.030000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.078000 0.078000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 7.800000e-02 0.078000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 2.130000e-01 0.213000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -6.456360e+16 -1.132695e+15 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -7.378698e+16 -6.895979e+14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -7.378698e+16 -3.617009e+14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -3.135946e+17 -8.384884e+14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 -420.000000 -9999.000000 3.000000 0.000000 0.000000 0.000000 0.000000 2.015010e+07 1.970010e+07 0.000000 -9999.000000 2015.000000 1.000000 1.000000 5479.000000 2015.000000 1.000000 1.000000 5479.000000
25% 1.000000 0.000000 NaN 4.000000 2.012053e+07 0.000000 248.000000 4.000000 4.000000 1.0 821.058500 821.058500 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 2.000000 17.000000 7.571429 2.000000 3718.143000 1705.418000 2.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12.000000 5.666667 34.000000 9.200000 3.000000 7.838667e+03 2116.791375 5.000000 1.250000 1.000000 0.307692 1.000000 0.166667 1.000000 0.166667 27.000000 7.200000 84.000000 11.153846 6.000000 1.964592e+04 2611.727415 14.000000 1.714286 3.000000 0.450000 2.000000 0.294118 2.000000 0.300000 69.000000 9.000000 233.000000 12.710028 16.000000 5.537270e+04 2.998566e+03 42.000000 2.113636 11.000000 0.565217 7.000000 0.382353 7.000000 0.387097 196.000000 10.500000 427.000000 13.490147 27.000000 1.021806e+05 3.180790e+03 81.000000 2.329249 22.000000 0.625000 14.000000 0.426950 14.000000 0.427574 360.000000 11.166667 737.750000 14.224490 43.000000 1.753236e+05 3.351469e+03 145.000000 2.510204 39.000000 0.677632 25.000000 0.454545 25.000000 0.454545 623.000000 11.806058 1002.000000 14.798119 56.000000 2.145400e+05 3.373914e+03 205.000000 2.651424 56.000000 0.722222 35.000000 0.481250 35.000000 0.478528 853.000000 12.291667 240.000000 990.000000 3.936907 0.000000 0.000000 36.000000 30.000000 99.000000 99.000000 0.000000 2.017012e+07 2.017021e+07 0.000000 3.300000 2017.000000 2.000000 17.000000 6247.000000 2015.000000 1.000000 2.000000 5481.000000
50% 4.000000 17.000000 NaN 7.000000 2.014073e+07 1.000000 509.000000 11.000000 11.000000 1.0 2595.075000 2595.075000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 9.000000 53.000000 16.000000 3.000000 12904.532500 3780.617429 8.000000 2.600000 2.000000 0.666667 1.000000 0.500000 2.000000 0.500000 47.000000 13.750000 105.000000 17.100000 6.000000 2.553986e+04 4091.579700 17.000000 3.000000 4.000000 0.800000 3.000000 0.500000 3.000000 0.538462 93.000000 15.000000 238.000000 18.777778 13.000000 5.835679e+04 4496.575193 42.000000 3.500000 11.000000 0.928571 7.000000 0.612903 8.000000 0.636364 214.000000 16.500000 667.000000 19.993750 36.000000 1.652595e+05 4.812825e+03 123.000000 3.935484 33.000000 1.000000 22.000000 0.671875 22.000000 0.692308 605.000000 17.683439 1263.000000 20.583333 65.000000 3.145153e+05 4.970085e+03 235.000000 4.128933 63.000000 1.070945 41.000000 0.706897 43.000000 0.727273 1149.000000 18.196254 2317.000000 21.266667 115.000000 5.750376e+05 5.154735e+03 431.000000 4.326370 117.000000 1.130952 76.000000 0.733333 78.000000 0.750000 2102.000000 18.867847 3584.000000 21.936890 169.000000 8.452459e+05 5.233477e+03 681.000000 4.500000 186.000000 1.181818 119.000000 0.757725 121.000000 0.769231 3299.000000 19.508859 440.000000 1788.000000 4.898667 0.000000 0.000000 39.000000 30.000000 149.000000 149.000000 1.000000 2.017020e+07 2.017030e+07 0.000000 4.966667 2017.000000 2.000000 26.000000 6265.000000 2015.000000 2.000000 8.000000 5710.000000
75% 13.000000 27.000000 NaN 9.000000 2.016012e+07 1.000000 776.000000 27.000000 27.000000 1.0 6509.136500 6509.136500 5.000000 5.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 24.000000 24.000000 124.000000 28.666667 5.000000 31483.468500 7136.028982 24.000000 6.000000 6.000000 1.500000 4.000000 1.000000 4.000000 1.000000 118.000000 26.833333 236.000000 29.000000 10.000000 5.991438e+04 7206.990196 46.000000 6.285714 12.000000 1.571429 8.000000 1.000000 8.000000 1.000000 225.000000 27.090909 511.000000 29.545455 20.000000 1.296859e+05 7375.110225 103.000000 6.720556 26.000000 1.666667 17.000000 1.052632 18.000000 1.125000 484.000000 27.826087 1464.000000 30.166667 56.000000 3.737988e+05 7.603972e+03 295.000000 7.100000 76.000000 1.769231 49.000000 1.098120 52.000000 1.166667 1394.000000 28.634146 2802.000000 30.577022 108.000000 7.176428e+05 7.747661e+03 567.000000 7.290698 147.000000 1.833333 95.000000 1.128512 100.000000 1.187500 2663.000000 29.039801 5258.000000 31.154067 203.000000 1.347311e+06 7.914708e+03 1058.000000 7.466156 275.000000 1.888889 175.000000 1.149123 185.000000 1.200000 5003.000000 29.706403 9046.250000 31.791510 346.000000 2.223149e+06 7.989009e+03 1798.000000 7.621627 468.000000 1.942504 297.000000 1.167134 313.000000 1.217949 8601.250000 30.343668 615.000000 3193.000000 5.164424 0.000000 0.000000 41.000000 30.000000 149.000000 149.000000 1.000000 2.017022e+07 2.017032e+07 0.000000 4.966667 2017.000000 2.000000 28.000000 6268.000000 2016.000000 8.000000 19.000000 5959.000000
max 22.000000 1051.000000 NaN 13.000000 2.017023e+07 1.000000 789.000000 661.000000 661.000000 1.0 437603.468000 437603.468000 570.000000 570.000000 195.000000 195.000000 71.000000 71.000000 224.000000 224.000000 1859.000000 1859.000000 2992.000000 748.000000 7.000000 604487.476000 437603.468000 3957.000000 989.250000 354.000000 132.500000 369.000000 65.500000 1207.000000 172.428571 3211.000000 1859.000000 5290.000000 661.000000 14.000000 1.209334e+06 437603.468000 5381.000000 676.000000 613.000000 132.500000 385.000000 65.500000 1877.000000 156.416667 5765.000000 1859.000000 10086.000000 560.333333 31.000000 2.665440e+06 187136.586000 10384.000000 676.000000 1131.000000 132.500000 585.000000 65.500000 3930.000000 151.826087 13493.000000 810.000000 26703.000000 560.333333 90.000000 7.546893e+06 2.181567e+05 35029.000000 676.000000 3017.000000 88.333333 1336.000000 65.500000 11151.000000 159.300000 35188.000000 871.250000 52598.000000 560.333333 180.000000 1.519852e+07 1.782986e+05 166936.000000 932.603352 8311.000000 104.333333 3287.000000 65.500000 19803.000000 133.446429 64190.000000 710.333333 103640.000000 560.333333 365.000000 2.993833e+07 1.549087e+05 519451.000000 1427.063187 19246.000000 60.500000 8016.000000 65.500000 26622.000000 86.155340 140785.000000 647.666667 234810.000000 560.333333 790.000000 9.223372e+15 1.358376e+13 911417.000000 1298.314815 37859.000000 60.500000 16436.000000 65.500000 27315.000000 58.358824 387552.000000 647.666667 1690.000000 5908.000000 53.189189 450.000000 6.000000 41.000000 450.000000 2000.000000 2000.000000 1.000000 2.017023e+07 2.017033e+07 1.000000 6.000000 2017.000000 12.000000 31.000000 6268.000000 2017.000000 12.000000 31.000000 6268.000000

Next, we convert the gender variable (a string) to dummy encoding:

#Convert gender variable:
dummy = pd.get_dummies(df_fa['gender'])
df_fa = pd.concat([df_fa, dummy], axis=1)
df_fa.drop('gender', axis=1, inplace=True)
"""Note, we're not concerned about collinearity having both a female and a male category,
as there are several cases where both values are 0, presumably because the user did not
supply the information.  Thus, the two columns, male and female, capture the 3 cases:
male, female, and 'not supplied'. """
;
''

A couple more quick inspections:

df_fa.head()
city bd registered_via registration_init_time is_churn date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_transaction_date latest_expire_date latest_is_cancel latest_amount_paid_per_day date_featuresdatemax_date_year date_featuresdatemax_date_month date_featuresdatemax_date_day date_featuresdatemax_date_absday date_featuresdatemin_date_year date_featuresdatemin_date_month date_featuresdatemin_date_day date_featuresdatemin_date_absday female male
msno
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= 1 0 13 20170120 0 35 1 1 1 10.068 10.068 2 2 0 0 0 0 0 0 0 0 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 27 6.750000 4 3158.450 789.612500 33 8.250000 1 0.250000 1 0.250000 1 0.250000 9 2.250000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 60 258 4.300000 0 0.0 30 30 129 129 1 20170220 20170319 0 4.300000 2017 2 24 6264 2017 1 20 6229 0 0
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= 1 0 13 20160907 0 173 21 21 1 2633.631 2633.631 13 13 3 3 1 1 1 1 8 8 228 32.571429 7 32731.138 4675.876857 140 20.000000 29 4.142857 14 2.000000 20 2.857143 95 13.571429 512 36.571429 14 98422.408 7030.172000 301 21.500000 71 5.071429 34 2.428571 42 3.000000 305 21.785714 1044 36.000000 29 178909.861 6169.305552 656 22.620690 135 4.655172 61 2.103448 74 2.551724 571 19.689655 4298 58.876712 73 632743.845 8667.723904 2717 37.219178 393 5.383562 188 2.575342 189 2.589041 2094 28.684932 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 180 774 4.300000 0 0.0 30 30 129 129 1 20170207 20170306 0 4.300000 2017 2 27 6267 2016 9 7 6094 0 0
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= 1 0 13 20160902 0 177 1 1 1 271.093 271.093 0 0 0 0 0 0 0 0 1 1 243 34.714286 7 60581.740 8654.534286 3 0.428571 1 0.142857 2 0.285714 1 0.142857 238 34.000000 489 34.928571 14 122772.792 8769.485143 32 2.285714 2 0.142857 5 0.357143 11 0.785714 476 34.000000 899 32.107143 28 231622.820 8272.243571 73 2.607143 7 0.250000 11 0.392857 14 0.500000 893 31.892857 2396 35.235294 68 770040.608 11324.126588 180 2.647059 32 0.470588 32 0.470588 33 0.485294 2953 43.426471 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 180 774 4.300000 0 0.0 30 30 129 129 1 20170202 20170301 0 4.300000 2017 2 26 6266 2016 9 2 6089 0 0
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= 1 0 13 20161028 0 123 17 17 1 1626.704 1626.704 15 15 1 1 1 1 0 0 6 6 121 24.200000 5 30054.147 6010.829400 29 5.800000 10 2.000000 5 1.000000 2 0.400000 123 24.600000 192 17.454545 11 43518.795 3956.254091 44 4.000000 11 1.000000 7 0.636364 7 0.636364 174 15.818182 457 17.576923 26 111841.140 4301.582308 80 3.076923 23 0.884615 15 0.576923 15 0.576923 456 17.538462 1229 21.189655 58 287422.839 4955.566190 203 3.500000 81 1.396552 55 0.948276 60 1.034483 1145 19.741379 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 150 596 3.973333 0 0.0 30 30 149 149 1 20170228 20170327 0 4.966667 2017 2 28 6268 2016 10 28 6145 0 0
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= 1 0 13 20161004 0 22 1 1 1 156.204 156.204 0 0 0 0 1 1 0 0 0 0 14 7.000000 2 2399.824 1199.912000 4 2.000000 5 2.500000 4 2.000000 0 0.000000 5 2.500000 17 4.250000 4 2630.818 657.704500 5 1.250000 7 1.750000 4 1.000000 0 0.000000 5 1.250000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 150 645 4.300000 0 0.0 30 30 129 129 1 20170204 20170303 0 4.300000 2016 10 26 6143 2016 10 4 6121 0 0

The team added a few more features to improve the model, comparing (subtracting) the latest transaction date, latest expiry date, and latest user log date in the cell below:

#First transform these into datetime, then into 4 components
df_fa['latest_transaction_date'] = pd_to_date(df_fa['latest_transaction_date'])
df_fa['latest_expire_date'] = pd_to_date(df_fa['latest_expire_date'])
split_date_col('latest_transaction_date')
split_date_col('latest_expire_date')

#Now perform the subtraction of all 3 combinations
df_fa['latest_trans_vs_expire'] = df_fa['latest_transaction_date_absday'] - df_fa['latest_expire_date_absday']
df_fa['latest_trans_vs_log'] = df_fa['latest_transaction_date_absday'] - df_fa['date_featuresdatemax_date_absday']
df_fa['latest_log_vs_expire'] = df_fa['date_featuresdatemax_date_absday'] - df_fa['latest_expire_date_absday']

df_fa.head()
city bd registered_via registration_init_time is_churn date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_is_cancel latest_amount_paid_per_day date_featuresdatemax_date_year date_featuresdatemax_date_month date_featuresdatemax_date_day date_featuresdatemax_date_absday date_featuresdatemin_date_year date_featuresdatemin_date_month date_featuresdatemin_date_day date_featuresdatemin_date_absday female male latest_transaction_date_year latest_transaction_date_month latest_transaction_date_day latest_transaction_date_absday latest_expire_date_year latest_expire_date_month latest_expire_date_day latest_expire_date_absday latest_trans_vs_expire latest_trans_vs_log latest_log_vs_expire
msno
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= 1 0 13 20170120 0 35 1 1 1 10.068 10.068 2 2 0 0 0 0 0 0 0 0 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 27 6.750000 4 3158.450 789.612500 33 8.250000 1 0.250000 1 0.250000 1 0.250000 9 2.250000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 60 258 4.300000 0 0.0 30 30 129 129 1 0 4.300000 2017 2 24 6264 2017 1 20 6229 0 0 2017 2 20 6260 2017 3 19 6287 -27 -4 -23
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= 1 0 13 20160907 0 173 21 21 1 2633.631 2633.631 13 13 3 3 1 1 1 1 8 8 228 32.571429 7 32731.138 4675.876857 140 20.000000 29 4.142857 14 2.000000 20 2.857143 95 13.571429 512 36.571429 14 98422.408 7030.172000 301 21.500000 71 5.071429 34 2.428571 42 3.000000 305 21.785714 1044 36.000000 29 178909.861 6169.305552 656 22.620690 135 4.655172 61 2.103448 74 2.551724 571 19.689655 4298 58.876712 73 632743.845 8667.723904 2717 37.219178 393 5.383562 188 2.575342 189 2.589041 2094 28.684932 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 180 774 4.300000 0 0.0 30 30 129 129 1 0 4.300000 2017 2 27 6267 2016 9 7 6094 0 0 2017 2 7 6247 2017 3 6 6274 -27 -20 -7
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= 1 0 13 20160902 0 177 1 1 1 271.093 271.093 0 0 0 0 0 0 0 0 1 1 243 34.714286 7 60581.740 8654.534286 3 0.428571 1 0.142857 2 0.285714 1 0.142857 238 34.000000 489 34.928571 14 122772.792 8769.485143 32 2.285714 2 0.142857 5 0.357143 11 0.785714 476 34.000000 899 32.107143 28 231622.820 8272.243571 73 2.607143 7 0.250000 11 0.392857 14 0.500000 893 31.892857 2396 35.235294 68 770040.608 11324.126588 180 2.647059 32 0.470588 32 0.470588 33 0.485294 2953 43.426471 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 180 774 4.300000 0 0.0 30 30 129 129 1 0 4.300000 2017 2 26 6266 2016 9 2 6089 0 0 2017 2 2 6242 2017 3 1 6269 -27 -24 -3
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= 1 0 13 20161028 0 123 17 17 1 1626.704 1626.704 15 15 1 1 1 1 0 0 6 6 121 24.200000 5 30054.147 6010.829400 29 5.800000 10 2.000000 5 1.000000 2 0.400000 123 24.600000 192 17.454545 11 43518.795 3956.254091 44 4.000000 11 1.000000 7 0.636364 7 0.636364 174 15.818182 457 17.576923 26 111841.140 4301.582308 80 3.076923 23 0.884615 15 0.576923 15 0.576923 456 17.538462 1229 21.189655 58 287422.839 4955.566190 203 3.500000 81 1.396552 55 0.948276 60 1.034483 1145 19.741379 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 150 596 3.973333 0 0.0 30 30 149 149 1 0 4.966667 2017 2 28 6268 2016 10 28 6145 0 0 2017 2 28 6268 2017 3 27 6295 -27 0 -27
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= 1 0 13 20161004 0 22 1 1 1 156.204 156.204 0 0 0 0 1 1 0 0 0 0 14 7.000000 2 2399.824 1199.912000 4 2.000000 5 2.500000 4 2.000000 0 0.000000 5 2.500000 17 4.250000 4 2630.818 657.704500 5 1.250000 7 1.750000 4 1.000000 0 0.000000 5 1.250000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 150 645 4.300000 0 0.0 30 30 129 129 1 0 4.300000 2016 10 26 6143 2016 10 4 6121 0 0 2017 2 4 6244 2017 3 3 6271 -27 101 -128
df_fa.shape
(88544, 159)

7. Quick Exploratory Data Analysis

Next, we perform some quick EDA on our data.

Combined Dataset

color = ['red', 'blue']
plt.figure()
for color, i, name in zip(color, [0,1], ['no_churn', 'churn']):
    plt.scatter(df_fa[df_fa['is_churn'] == i]['date_featuresdatelistening_tenure'],
               df_fa[df_fa['is_churn'] == i]['within_days_7num_unqmean'], color = color, alpha = 0.2, label = name)
plt.legend(loc = 'best')
plt.xlabel('Listening Tenure')
plt.ylabel('Mean Number of Unique listening Periods in the last 7 days')
Text(0,0.5,'Mean Number of Unique listening Periods in the last 7 days')

Drawing

Looking at the above plot, we can see that overall people who churn have low numbers of unique plays, meaning they aren't using the music service. We can also see a spike in number of unique plays for users who have a long tenure (>700 days). Intuitively this makes sense as users who are commited to the service (have used the service for a long time) may have developed lifestyle patterns where they use the service while driving/working/etc.

avg_price_no_churn = round(df_fa[df_fa['is_churn'] == 0]['amount_paid_per_day'].mean(), 2)
avg_price_is_churn = round(df_fa[df_fa['is_churn'] == 1]['amount_paid_per_day'].mean(), 2)
print('Avg cost/day for no churn: %.2f' %avg_price_no_churn)
print('Avg cost/day for churn: %.2f' %avg_price_is_churn)
Avg cost/day for no churn: 4.54
Avg cost/day for churn: 4.67

For users who churn, we can see that they tend to spend more per day on the music subscription service. Our analysis will end with an economic analysis which uses the predictions to model how much incentive should be offered to specific users. This will hopefully help incentivize those customers that will churn and prevent it.

corr = df_fa.iloc[:, 8:99:15].corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask)
<matplotlib.axes._subplots.AxesSubplot at 0x2261eb32550>

Drawing

The correlation plot shows that there is correlation amongst the number of unique songs played within the last X number of days (where if the two X values are close together the correlation tends to be higher). This makes sense and is what was expected.

mean_col = df_fa.iloc[:, 7:98:15].mean()
mean_col
within_days_1num_unqmean      20.698794
within_days_7num_unqmean      21.900082
within_days_14num_unqmean     22.383794
within_days_31num_unqmean     23.271521
within_days_90num_unqmean     24.238664
within_days_180num_unqmean    24.740990
within_days_365num_unqmean    25.245055
dtype: float64
#plt.plot_date(reg_time, df_fa['within_days_7total_secsmean'], )
color = ['red', 'blue']
for color, i, name in zip(color, [0,1], ['no_churn', 'churn']):
    plt.plot_date(pd.to_datetime(df_fa[df_fa['is_churn'] == i]['registration_init_time'], format = '%Y%m%d'),
               df_fa[df_fa['is_churn'] == i]['within_days_7total_secsmean']/(60*60), color = color, alpha = 0.2, label = name)
plt.legend(loc = 'best')
plt.xlabel('Registration Date')
plt.ylim([0,20])
plt.ylabel('Number of Seconds (Avg) Listed during last 7 days')
Text(0,0.5,'Number of Seconds (Avg) Listed during last 7 days')

Drawing

This plot shows the Registration Date vs Avg seconds listened over past 7 days. The intent will be to change this to transaction date as this will more accurately reflect when the user ends service. We can see a trend in more churn occuring in the last 4 years of the dataset. This could be due to an increased level of users however (same proportion of churn).

df_fa['registration_time'] = pd.to_datetime(df_fa['registration_init_time'], format = '%Y%m%d').map(lambda x: x.year)
reg_count = []
thirty_day_churn = []
for year in range(2005, 2018):
    reg_count.append(sum(df_fa['registration_time'] == year))
    thirty_day_churn.append(len(df_fa[(df_fa['registration_time'] == year) & (df_fa['date_featuresdatelistening_tenure'] < 30) & (df_fa['is_churn'] == 1)])/sum(df_fa['registration_time'] == year))
plt.bar(range(2005, 2018), reg_count)
plt.xlabel('Year of Registration')
plt.title('Registration Count Per Year')
plt.show
<function matplotlib.pyplot.show>

Drawing

plt.bar(range(2005, 2018), thirty_day_churn)
plt.xlabel('Year')
plt.title('Proportion of 30 day Churn to Total Registration')
plt.show
<function matplotlib.pyplot.show>

Drawing

This statistic shows that there is a high level of churn recently for users who are using a trial subscription, or who a very short duration subscription. From a business perspective this is concerning, because it means that a very high level of churn is occuring, and the problem needs to be solved to reduce churn levels back. The sharp spike may have been when the company started offering 30 days subscriptions or some other factor that has caused increased churn for customers with less than 30 day subscriptionps.

plt.hist(df_fa['date_featuresdatemax_date_month'])
plt.title('Histogram of the last listen date')
plt.show()

Drawing

The dataset (from the dataset description) was collected to include churn statistics for the months of February and March. We can see here a large peak for the month of February, which intuitively makes sense because users are utilizing the service often and most recently listened in the month in which the data was collected.

print('Is Latest Cancel and Is Churn Correlation: %0.4f' %np.corrcoef(df_fa['latest_is_cancel'], df_fa['is_churn'])[0,1])
Is Latest Cancel and Is Churn Correlation: 0.4328

Intuitively we had thought there would be a very high correlation between these two variables. However we can see here that, while there is decent correlation between the two variables, they are not perfectly aligned. This is due to the times when a user cancels their service but does not churn (changes plan type, etc).

User Logs EDA

user_logs['num_unq'].groupby(user_logs['date'].dt.weekday_name).count().plot(kind = 'bar', title = 'Unique Play Count Grouped by Weekday')
<matplotlib.axes._subplots.AxesSubplot at 0x2261b9b3f28>

Drawing

The figure above looks at listening behavior grouped by the day of the week. We can see that listening is fairly consitent across all days of the week. The above plot looks at listening counts, but doesn't analyze how long a user is listening for, we can look at the total seconds feature included in the dataset to determine this.

#This cell takes a long time to plot, print statistics to screen instead
user_logs.groupby(user_logs['date'].dt.weekday_name)['total_secs'].sum()
date
Friday      -4.334985e+18
Monday      -3.892263e+18
Saturday    -3.431094e+18
Sunday      -3.855369e+18
Thursday    -3.919933e+18
Tuesday     -3.615562e+18
Wednesday   -3.486435e+18
Name: total_secs, dtype: float64

There is some odd behavior here, where all of the sums are large negative values. We need to analyze the column itself in order to determine why this odd behavior is observed.

user_logs['total_secs'].describe()
count    1.971063e+07
mean    -1.346260e+12
std      1.116172e+14
min     -9.223372e+15
25%      1.966237e+03
50%      4.703210e+03
75%      1.028291e+04
max      9.223372e+15
Name: total_secs, dtype: float64

The mean value for this column is in fact negative, which intuitively doesn't make sense as it is not possible to listen to a song a for a negative time period. Further analysis regarding the method in which the software calculates total_secs will be important to better understand why negative values are being written to the database.

To further explore the total seconds columns, below we analyze the sum of the total seconds by user.

#This cell takes a long time to plot, print statistics to screen instead
user_logs['total_secs'].groupby(user_logs['msno'], sort = False).sum().describe()
count    8.855000e+04
mean    -2.996684e+14
std      3.620704e+15
min     -3.135946e+17
25%      2.145492e+05
50%      8.453256e+05
75%      2.223251e+06
max      9.223372e+15
Name: total_secs, dtype: float64

We continue to see negative values, even when summing over the total listening time for a specific user. This behavior is strange and now well understood. Further investigation into the matter will be necessary to determine why negative listening times are computed.

corr = user_logs.iloc[:, 2:7].corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask)
<matplotlib.axes._subplots.AxesSubplot at 0x2261ba97630>

png

Drawing

This correlation plot anaylzes the correlation between the num columns in the dataset. From the definition at the start of this notebook, these columns represent the songs that have been played to a certain proportion of the total song length (25%, 50%, etc). Here we can see relatively low correlation between the columns.

The higher listening buckets do not factor in songs that fall below the bucket cutoff (ie for the num_50 bucket it doesn't count a song if it is played <50%, but rather 25-50% of the time). We found this behavior insteresting and wanted to further analyze by plotting a bar chart grouped by these columns.

sum_num = []
cols = []
for col in user_logs.iloc[:,2:7].columns:
    sum_num.append(user_logs[col].sum())
    cols.append(col)

plt.bar(np.arange(len(sum_num)), sum_num)
plt.xticks(np.arange(len(sum_num)), cols)
plt.title('Sum of Proportion Song Listened To')
Text(0.5,1,'Sum of Proportion Song Listened To')

Drawing

We can see that num_100 (or listening to the entire song) is by far the most popular method. Num_25 (or less than 25% listened to) was the second most popular column. Intuitively this makes sense as a user is most likely either to like a song and listen to the whole thing, or not like a song and quickly switch to a new song.

Members EDA

members['gender'].groupby(members['gender']).count().plot(kind = 'bar', title = 'Members grouped by gender')
<matplotlib.axes._subplots.AxesSubplot at 0x2261c88c4e0>

Drawing

print('Proportion of gender column = N/A: %0.4f' %(members['gender'].isnull().sum()/len(members)))
Proportion of gender column = N/A: 0.4843

Breakdown by gender, we can see that there are overall a larger proportion of males that utilize the kkbox dataset. Additionally we can see that over 48% of this column is N/A values (meaning that the user didn't supply their gender). This has been accounted for in the combined dataset by separating into two colums of 'male' and 'female' and capturing data in binary format.

members['city'].groupby(members['city']).count().plot(kind = 'bar', title = 'Count by city')
<matplotlib.axes._subplots.AxesSubplot at 0x2261b4caf28>

Drawing

We can see that overall city = 1 has a significantly higher proportion of users, than any other city. The mapping between numeric and actual city names isn't provided, however from an ML perspective this is alright, however for business purposes it would be preferable to have the city names.

members['bd'].groupby(members['bd']).count().plot(kind = 'bar', title = 'Count by birthdate')
<matplotlib.axes._subplots.AxesSubplot at 0x2261f0ccba8>

Drawing

The birthdate column has a large peak of entries at the 0 value (presumably not provided by user). Below we filter out the zero value to look at this distribution of birthdays.

members['bd'].groupby(members['bd']).count().plot(kind = 'bar', title = 'Count by birthdate Filtered')
plt.xlim([3,100])
plt.ylim([0,3000])
plt.show()

Drawing

The distribution of provided birthdays appears to be skewed slightly. From the first plot it was hard to determine the range of entries, below we use the describe function to look at the birthday column. From the data provided we see that there are negative values for the birthate (min = -49) which is odd and needs to be further understood. It would also be helpful to know how this integer is defined (presumably days since a specifc date).

members['bd'].describe()
count    89473.000000
mean        14.915069
std         18.416087
min        -49.000000
25%          0.000000
50%         17.000000
75%         27.000000
max       1051.000000
Name: bd, dtype: float64

The members table provides information about each of the members subscribed to the KKBox service. Through our EDA we noted that there were a few columns with dispraportionately high values, which could make analysis using a machine learning algorithm difficult, as this one common value may make it difficult to discern information from the other data. By including a high number of features, our model will hopefully overcome this difficulty and accurately predict churn.

Moving forward it is preferrable to better understand the data sources and computation and clean up the data collection as best as possible.

Transactions EDA

The transactions table contains information related to the subscription to KKBox, including the plan list price, the actual price the user paid as well as payment methods and dates. For payment method, the column is contains integers that map to a specific type of payment, we have not been granted access to the mapping, however from an ML perspective, we can continue to carry out models. When deriving business insight however, access to the mapping of this column will be important.

transactions['payment_method_id'].groupby(transactions['payment_method_id']).count().plot(kind = 'bar', title = 'Count by payment type')
<matplotlib.axes._subplots.AxesSubplot at 0x2261b7f1588>

Drawing

From the plot above we can see that payment method 41 is by far the most popular.

transactions['plan_list_price'].groupby(transactions['plan_list_price']).count().plot(kind = 'bar', title = 'Count by plan list price')
<matplotlib.axes._subplots.AxesSubplot at 0x2262186a6a0>

Drawing

Above we analyze the popularity of specific plans on kkbox. From their website it appears that there is a free tier, monthly and yearly tiers (this information is as of todays date (4/21/2018) however we do not know the specific plan offerings for the period of data collection for this analysis.

From the plot above we can see there is an overwhelmingly popular plan price at approximately 149.

transactions['prop_collected'] = round(transactions['actual_amount_paid'] / transactions['plan_list_price'], 3)
transactions['prop_collected'].groupby(transactions['prop_collected']).count().plot(kind = 'bar', title = 'Count by Proportion of Bill Collected')
<matplotlib.axes._subplots.AxesSubplot at 0x22621958828>

Drawing

There are two columns in the table that analyze the plan cost and amount collected. We wanted to look and see the distribution of memberships that do not collect the total price of the plan. From above we can see that most plans are fully paid, there is an infinity value which is most likely an error (plan list price of 0 presumably), and there is a peak at 80%

transactions['transaction_date'] = pd_to_date(transactions['transaction_date'])
transactions['is_cancel'].groupby(transactions['transaction_date'].dt.month).sum().plot(kind = 'bar', title = 'Is Cancel by Month')
<matplotlib.axes._subplots.AxesSubplot at 0x2262195d748>

Drawing

Our team next wanted to look at the is_cancel category, which has some correlation with churn. By month we can see that users peaked in January and especially February. It would be interesting to see how this correlates to churn (as is_cancel isn't perfectly correlated). If there is a seasonality to churn, then it would be important for predicting churn, and providing business insight to offer incentives to users who's memberships expire in this timeframe to renew.

print('Is Latest Cancel and Is Auto Renew: %0.4f' %np.corrcoef(transactions['is_cancel'], transactions['is_auto_renew'])[0,1])
Is Latest Cancel and Is Auto Renew: 0.0806

Our team had thought that the cancel and auto renew categories would be negatively correlated (if a user was on an auto renew membership, they may not pay as much attention and forget to cancel/change membership). From the covariance we can see that there is almost no correlation between the two features.

The transactions

8. Writing Output

Having extracted our features and performing some data manipulation, we will now write the features file to a .pkl file, allowing us to use this output in the second notebook without having to run all the code above.

#Write all features to pkl file
df_fa.to_pickle('df_fa.pkl')

Music Churn: Predictive Modeling Notebook

Python Notebook 2 of 3

W207, Final Project

Spring, 2018

Team: Cameron Kennedy, Gaurav Khanna, Aaron Olson

Overview of Notebooks

For this project, the team created 3 separate Jupyter Notebooks to document its work. See notebook #1, (Data Preparation / Feature Extraction) for a brief description of each notebook.

Table of Contents (this notebook only)

  1. Setup and Loading Libraries
  2. Data Preparation
  3. Predictive Modeling!
  4. Calculating Probabilities
  5. Economic Impact
  6. Final Insights and Takeaways

1. Setup and Loading Libraries

#Import Required Libraries
#Data manipulation and visualization
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
%matplotlib inline

#Models et al
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
import xgboost
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import (brier_score_loss, precision_score, recall_score, f1_score, log_loss)
#from sklearn.preprocessing import CategoricalEncoder  #Not yet released!

#Metrics
from sklearn.metrics import (roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, 
    precision_score, confusion_matrix, classification_report)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Now we'll load the data and print the first few rows:

# Load the data
df_fa = pd.read_pickle('df_fa.pkl')  #Pickle format preserves file as python object

#Set initial parameter(s)
pd.set_option('display.max_rows', 200)
pd.options.display.max_columns = 2000

#Ensure it's what we expect:
print(df_fa.shape)
df_fa.head()
(88544, 160)
city bd registered_via registration_init_time is_churn date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_is_cancel latest_amount_paid_per_day date_featuresdatemax_date_year date_featuresdatemax_date_month date_featuresdatemax_date_day date_featuresdatemax_date_absday date_featuresdatemin_date_year date_featuresdatemin_date_month date_featuresdatemin_date_day date_featuresdatemin_date_absday female male latest_transaction_date_year latest_transaction_date_month latest_transaction_date_day latest_transaction_date_absday latest_expire_date_year latest_expire_date_month latest_expire_date_day latest_expire_date_absday latest_trans_vs_expire latest_trans_vs_log latest_log_vs_expire registration_time
msno
mKfgXQAmVeSKzN4rXW37qz0HbGCuYBspTBM3ONXZudg= 1 0 13 20170120 0 35 1 1 1 10.068 10.068 2 2 0 0 0 0 0 0 0 0 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 1 1.000000 1 10.068 10.068000 2 2.000000 0 0.000000 0 0.000000 0 0.000000 0 0.000000 27 6.750000 4 3158.450 789.612500 33 8.250000 1 0.250000 1 0.250000 1 0.250000 9 2.250000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 29 4.833333 6 3245.638 540.939667 41 6.833333 1 0.166667 1 0.166667 1 0.166667 9 1.500000 60 258 4.300000 0 0.0 30 30 129 129 1 0 4.300000 2017 2 24 6264 2017 1 20 6229 0 0 2017 2 20 6260 2017 3 19 6287 -27 -4 -23 2017
AFcKYsrudzim8OFa+fL/c9g5gZabAbhaJnoM0qmlJfo= 1 0 13 20160907 0 173 21 21 1 2633.631 2633.631 13 13 3 3 1 1 1 1 8 8 228 32.571429 7 32731.138 4675.876857 140 20.000000 29 4.142857 14 2.000000 20 2.857143 95 13.571429 512 36.571429 14 98422.408 7030.172000 301 21.500000 71 5.071429 34 2.428571 42 3.000000 305 21.785714 1044 36.000000 29 178909.861 6169.305552 656 22.620690 135 4.655172 61 2.103448 74 2.551724 571 19.689655 4298 58.876712 73 632743.845 8667.723904 2717 37.219178 393 5.383562 188 2.575342 189 2.589041 2094 28.684932 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 9218 59.857143 154 1232770.399 8005.002591 5289 34.344156 838 5.441558 410 2.662338 323 2.097403 4204 27.298701 180 774 4.300000 0 0.0 30 30 129 129 1 0 4.300000 2017 2 27 6267 2016 9 7 6094 0 0 2017 2 7 6247 2017 3 6 6274 -27 -20 -7 2016
qk4mEZUYZq+4sQE7bzRYKc5Pvj+Xc7Wmu25DrCzltEU= 1 0 13 20160902 0 177 1 1 1 271.093 271.093 0 0 0 0 0 0 0 0 1 1 243 34.714286 7 60581.740 8654.534286 3 0.428571 1 0.142857 2 0.285714 1 0.142857 238 34.000000 489 34.928571 14 122772.792 8769.485143 32 2.285714 2 0.142857 5 0.357143 11 0.785714 476 34.000000 899 32.107143 28 231622.820 8272.243571 73 2.607143 7 0.250000 11 0.392857 14 0.500000 893 31.892857 2396 35.235294 68 770040.608 11324.126588 180 2.647059 32 0.470588 32 0.470588 33 0.485294 2953 43.426471 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 3580 32.844037 109 1137009.556 10431.280330 423 3.880734 72 0.660550 58 0.532110 58 0.532110 4308 39.522936 180 774 4.300000 0 0.0 30 30 129 129 1 0 4.300000 2017 2 26 6266 2016 9 2 6089 0 0 2017 2 2 6242 2017 3 1 6269 -27 -24 -3 2016
G2UGNLph2J6euGmZ7WIa1+Kc+dPZBJI0HbLPu5YtrZw= 1 0 13 20161028 0 123 17 17 1 1626.704 1626.704 15 15 1 1 1 1 0 0 6 6 121 24.200000 5 30054.147 6010.829400 29 5.800000 10 2.000000 5 1.000000 2 0.400000 123 24.600000 192 17.454545 11 43518.795 3956.254091 44 4.000000 11 1.000000 7 0.636364 7 0.636364 174 15.818182 457 17.576923 26 111841.140 4301.582308 80 3.076923 23 0.884615 15 0.576923 15 0.576923 456 17.538462 1229 21.189655 58 287422.839 4955.566190 203 3.500000 81 1.396552 55 0.948276 60 1.034483 1145 19.741379 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 1441 20.884058 69 326268.069 4728.522739 247 3.579710 115 1.666667 74 1.072464 76 1.101449 1272 18.434783 150 596 3.973333 0 0.0 30 30 149 149 1 0 4.966667 2017 2 28 6268 2016 10 28 6145 0 0 2017 2 28 6268 2017 3 27 6295 -27 0 -27 2016
EqSHZpMj5uddJvv2gXcHvuOKFOdS5NN6RalHfzEhhaI= 1 0 13 20161004 0 22 1 1 1 156.204 156.204 0 0 0 0 1 1 0 0 0 0 14 7.000000 2 2399.824 1199.912000 4 2.000000 5 2.500000 4 2.000000 0 0.000000 5 2.500000 17 4.250000 4 2630.818 657.704500 5 1.250000 7 1.750000 4 1.000000 0 0.000000 5 1.250000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 136 13.600000 10 15562.900 1556.290000 76 7.600000 36 3.600000 22 2.200000 7 0.700000 21 2.100000 150 645 4.300000 0 0.0 30 30 129 129 1 0 4.300000 2016 10 26 6143 2016 10 4 6121 0 0 2017 2 4 6244 2017 3 3 6271 -27 101 -128 2016
df_fa.describe(include='all')
city bd registered_via registration_init_time is_churn date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_is_cancel latest_amount_paid_per_day date_featuresdatemax_date_year date_featuresdatemax_date_month date_featuresdatemax_date_day date_featuresdatemax_date_absday date_featuresdatemin_date_year date_featuresdatemin_date_month date_featuresdatemin_date_day date_featuresdatemin_date_absday female male latest_transaction_date_year latest_transaction_date_month latest_transaction_date_day latest_transaction_date_absday latest_expire_date_year latest_expire_date_month latest_expire_date_day latest_expire_date_absday latest_trans_vs_expire latest_trans_vs_log latest_log_vs_expire registration_time
count 88544.000000 88544.000000 88544.000000 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.0 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 8.854400e+04 8.854400e+04 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000 88544.000000
mean 6.525038 14.980733 6.635018 2.013375e+07 0.505590 489.896266 20.698794 20.698794 1.0 5292.578450 5292.578450 4.751005 4.751005 1.181175 1.181175 0.712719 0.712719 0.790985 0.790985 20.047660 20.047660 94.714458 21.900082 3.630568 25398.043300 5679.751873 20.606806 5.048088 5.118721 1.264542 3.193474 0.764835 3.609866 0.839836 96.809236 21.497445 180.072506 22.383794 6.679696 4.851858e+04 5818.581592 38.873690 5.179982 9.669735 1.298046 6.069570 0.785710 6.831553 0.857796 185.109313 22.017982 386.728101 23.271521 13.902444 1.039538e+05 6040.184570 84.352706 5.450459 20.850831 1.359949 13.151473 0.825081 14.948783 0.904241 396.595557 22.841633 1093.917736 24.238664 37.852435 -8.333365e+11 -1.651270e+10 241.791143 5.797655 59.157650 1.442505 37.497256 0.874044 42.036321 0.943332 1121.233093 23.773736 2090.805633 24.740990 71.289811 -1.979174e+12 -2.291994e+10 464.508132 5.968142 114.264603 1.498918 72.136554 0.902613 80.350267 0.967418 2124.856648 24.115101 3896.207400 25.245055 131.113401 -6.041691e+12 -3.677399e+10 859.351520 6.090492 212.907018 1.541631 133.405527 0.921047 148.187715 0.979698 3949.711895 24.569486 6768.144403 25.652204 222.584850 -2.994804e+14 -7.498238e+11 1476.962256 6.214718 367.734708 1.582176 229.020227 0.940062 254.208992 0.992622 6864.914574 24.937062 437.796169 2037.669125 4.602800 2.989858 -5.606482 37.700296 52.462742 227.611199 227.036185 0.639874 0.174659 4.249737 2016.881234 2.525117 21.763507 6234.178363 2015.427177 4.265992 11.133335 5744.282097 0.242998 0.275456 2016.841740 2.525739 16.539788 6214.398627 2016.970320 2.601565 15.721178 6261.820010 -47.421384 -19.779737 -27.641647 2013.308423
std 6.551445 18.431336 2.234529 2.896439e+04 0.499972 271.954761 26.816538 26.816538 0.0 7839.883628 7839.883628 11.209580 11.209580 3.122491 3.122491 1.569186 1.569186 2.056548 2.056548 31.926074 31.926074 123.736267 22.179267 2.019227 37825.652389 6814.978016 40.174639 9.170749 9.780551 2.416056 5.525064 1.240117 8.967416 1.630039 154.190616 27.963725 232.545094 20.836810 4.052415 7.152130e+04 6539.990675 69.286828 8.255339 16.423555 2.159017 9.495713 1.138329 16.007510 1.546520 291.280077 26.898005 477.079939 19.372295 8.663977 1.469971e+05 6020.766839 135.636164 7.741246 30.899691 1.919181 18.358084 1.029136 36.104464 1.563621 600.717536 24.834585 1341.237189 18.684102 25.324540 2.191767e+14 3.964256e+12 382.272707 7.515411 82.445972 1.770876 50.426151 0.969375 89.884588 1.397562 1663.191691 24.005779 2583.529737 18.290344 50.173053 2.754958e+14 2.817791e+12 890.415732 7.904848 158.377603 1.721930 96.141760 0.912695 166.964108 1.336536 3129.427760 23.396972 4890.713126 17.842956 99.124183 3.640763e+14 2.148897e+12 2152.429448 8.587733 298.543419 1.634560 179.646957 0.882435 271.766276 1.137067 5839.985612 22.801391 9203.006883 17.417302 196.804976 3.620312e+15 9.459203e+12 3814.395114 8.261136 548.144978 1.622525 327.643138 0.872478 440.184627 1.023119 10865.335596 22.230638 241.813007 1234.637258 1.085134 51.117939 237.546015 4.484667 79.217446 332.886694 333.154660 0.480039 0.379677 47.556868 0.372700 2.183868 8.070249 94.115610 0.580881 3.899945 9.493778 261.169278 0.428896 0.446746 0.421588 2.609233 8.988002 96.625628 0.610677 0.950015 8.729970 221.500069 232.235329 123.588333 231.906681 2.899407
min 1.000000 -49.000000 3.000000 2.004033e+07 0.000000 0.000000 1.000000 1.000000 1.0 0.030000 0.030000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.078000 0.078000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 7.800000e-02 0.078000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 2.130000e-01 0.213000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -6.456360e+16 -1.132695e+15 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -7.378698e+16 -6.895979e+14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -7.378698e+16 -3.617009e+14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -3.135946e+17 -8.384884e+14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 -420.000000 -9999.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -9999.000000 2015.000000 1.000000 1.000000 5479.000000 2015.000000 1.000000 1.000000 5479.000000 0.000000 0.000000 2015.000000 1.000000 1.000000 5480.000000 1970.000000 1.000000 1.000000 -10957.000000 -771.000000 -783.000000 -820.000000 2004.000000
25% 1.000000 0.000000 4.000000 2.012053e+07 0.000000 248.000000 4.000000 4.000000 1.0 821.058500 821.058500 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 2.000000 17.000000 7.571429 2.000000 3718.143000 1705.418000 2.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12.000000 5.666667 34.000000 9.200000 3.000000 7.838667e+03 2116.791375 5.000000 1.250000 1.000000 0.307692 1.000000 0.166667 1.000000 0.166667 27.000000 7.200000 84.000000 11.153846 6.000000 1.964592e+04 2611.727415 14.000000 1.714286 3.000000 0.450000 2.000000 0.294118 2.000000 0.300000 69.000000 9.000000 233.000000 12.710028 16.000000 5.537270e+04 2.998566e+03 42.000000 2.113636 11.000000 0.565217 7.000000 0.382353 7.000000 0.387097 196.000000 10.500000 427.000000 13.490147 27.000000 1.021806e+05 3.180790e+03 81.000000 2.329249 22.000000 0.625000 14.000000 0.426950 14.000000 0.427574 360.000000 11.166667 737.750000 14.224490 43.000000 1.753236e+05 3.351469e+03 145.000000 2.510204 39.000000 0.677632 25.000000 0.454545 25.000000 0.454545 623.000000 11.806058 1002.000000 14.798119 56.000000 2.145400e+05 3.373914e+03 205.000000 2.651424 56.000000 0.722222 35.000000 0.481250 35.000000 0.478528 853.000000 12.291667 240.000000 990.000000 3.936907 0.000000 0.000000 36.000000 30.000000 99.000000 99.000000 0.000000 0.000000 3.300000 2017.000000 2.000000 17.000000 6247.000000 2015.000000 1.000000 2.000000 5481.000000 0.000000 0.000000 2017.000000 1.000000 9.000000 6226.000000 2017.000000 2.000000 8.000000 6254.000000 -31.000000 -28.000000 -25.000000 2012.000000
50% 4.000000 17.000000 7.000000 2.014073e+07 1.000000 509.000000 11.000000 11.000000 1.0 2595.075000 2595.075000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 9.000000 53.000000 16.000000 3.000000 12904.532500 3780.617429 8.000000 2.600000 2.000000 0.666667 1.000000 0.500000 2.000000 0.500000 47.000000 13.750000 105.000000 17.100000 6.000000 2.553986e+04 4091.579700 17.000000 3.000000 4.000000 0.800000 3.000000 0.500000 3.000000 0.538462 93.000000 15.000000 238.000000 18.777778 13.000000 5.835679e+04 4496.575193 42.000000 3.500000 11.000000 0.928571 7.000000 0.612903 8.000000 0.636364 214.000000 16.500000 667.000000 19.993750 36.000000 1.652595e+05 4.812825e+03 123.000000 3.935484 33.000000 1.000000 22.000000 0.671875 22.000000 0.692308 605.000000 17.683439 1263.000000 20.583333 65.000000 3.145153e+05 4.970085e+03 235.000000 4.128933 63.000000 1.070945 41.000000 0.706897 43.000000 0.727273 1149.000000 18.196254 2317.000000 21.266667 115.000000 5.750376e+05 5.154735e+03 431.000000 4.326370 117.000000 1.130952 76.000000 0.733333 78.000000 0.750000 2102.000000 18.867847 3584.000000 21.936890 169.000000 8.452459e+05 5.233477e+03 681.000000 4.500000 186.000000 1.181818 119.000000 0.757725 121.000000 0.769231 3299.000000 19.508859 440.000000 1788.000000 4.898667 0.000000 0.000000 39.000000 30.000000 149.000000 149.000000 1.000000 0.000000 4.966667 2017.000000 2.000000 26.000000 6265.000000 2015.000000 2.000000 8.000000 5710.000000 0.000000 0.000000 2017.000000 2.000000 17.000000 6245.000000 2017.000000 3.000000 16.000000 6271.000000 -29.000000 -14.000000 -11.000000 2014.000000
75% 13.000000 27.000000 9.000000 2.016012e+07 1.000000 776.000000 27.000000 27.000000 1.0 6509.136500 6509.136500 5.000000 5.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 24.000000 24.000000 124.000000 28.666667 5.000000 31483.468500 7136.028982 24.000000 6.000000 6.000000 1.500000 4.000000 1.000000 4.000000 1.000000 118.000000 26.833333 236.000000 29.000000 10.000000 5.991438e+04 7206.990196 46.000000 6.285714 12.000000 1.571429 8.000000 1.000000 8.000000 1.000000 225.000000 27.090909 511.000000 29.545455 20.000000 1.296859e+05 7375.110225 103.000000 6.720556 26.000000 1.666667 17.000000 1.052632 18.000000 1.125000 484.000000 27.826087 1464.000000 30.166667 56.000000 3.737988e+05 7.603972e+03 295.000000 7.100000 76.000000 1.769231 49.000000 1.098120 52.000000 1.166667 1394.000000 28.634146 2802.000000 30.577022 108.000000 7.176428e+05 7.747661e+03 567.000000 7.290698 147.000000 1.833333 95.000000 1.128512 100.000000 1.187500 2663.000000 29.039801 5258.000000 31.154067 203.000000 1.347311e+06 7.914708e+03 1058.000000 7.466156 275.000000 1.888889 175.000000 1.149123 185.000000 1.200000 5003.000000 29.706403 9046.250000 31.791510 346.000000 2.223149e+06 7.989009e+03 1798.000000 7.621627 468.000000 1.942504 297.000000 1.167134 313.000000 1.217949 8601.250000 30.343668 615.000000 3193.000000 5.164424 0.000000 0.000000 41.000000 30.000000 149.000000 149.000000 1.000000 0.000000 4.966667 2017.000000 2.000000 28.000000 6268.000000 2016.000000 8.000000 19.000000 5959.000000 0.000000 1.000000 2017.000000 2.000000 25.000000 6258.000000 2017.000000 3.000000 23.000000 6285.000000 -28.000000 0.000000 -1.000000 2016.000000
max 22.000000 1051.000000 13.000000 2.017023e+07 1.000000 789.000000 661.000000 661.000000 1.0 437603.468000 437603.468000 570.000000 570.000000 195.000000 195.000000 71.000000 71.000000 224.000000 224.000000 1859.000000 1859.000000 2992.000000 748.000000 7.000000 604487.476000 437603.468000 3957.000000 989.250000 354.000000 132.500000 369.000000 65.500000 1207.000000 172.428571 3211.000000 1859.000000 5290.000000 661.000000 14.000000 1.209334e+06 437603.468000 5381.000000 676.000000 613.000000 132.500000 385.000000 65.500000 1877.000000 156.416667 5765.000000 1859.000000 10086.000000 560.333333 31.000000 2.665440e+06 187136.586000 10384.000000 676.000000 1131.000000 132.500000 585.000000 65.500000 3930.000000 151.826087 13493.000000 810.000000 26703.000000 560.333333 90.000000 7.546893e+06 2.181567e+05 35029.000000 676.000000 3017.000000 88.333333 1336.000000 65.500000 11151.000000 159.300000 35188.000000 871.250000 52598.000000 560.333333 180.000000 1.519852e+07 1.782986e+05 166936.000000 932.603352 8311.000000 104.333333 3287.000000 65.500000 19803.000000 133.446429 64190.000000 710.333333 103640.000000 560.333333 365.000000 2.993833e+07 1.549087e+05 519451.000000 1427.063187 19246.000000 60.500000 8016.000000 65.500000 26622.000000 86.155340 140785.000000 647.666667 234810.000000 560.333333 790.000000 9.223372e+15 1.358376e+13 911417.000000 1298.314815 37859.000000 60.500000 16436.000000 65.500000 27315.000000 58.358824 387552.000000 647.666667 1690.000000 5908.000000 53.189189 450.000000 6.000000 41.000000 450.000000 2000.000000 2000.000000 1.000000 1.000000 6.000000 2017.000000 12.000000 31.000000 6268.000000 2017.000000 12.000000 31.000000 6268.000000 1.000000 1.000000 2017.000000 12.000000 31.000000 6268.000000 2017.000000 12.000000 31.000000 6299.000000 17207.000000 789.000000 17225.000000 2017.000000

2. Data Preparation

Splitting Train, Dev, and Test

First, we need to split the data into our train, dev, and test sets, which we'll do at rates of 60%, 25%, and 15% respectively.

#Split data into a) train, dev, & test, b) data & labels

np.random.seed(5)  #Set so that % churn is somewhat consistent

#Train, Dev, Test splits: 60/25/15
train, devtest = train_test_split(df_fa, test_size=0.4)
dev, test = train_test_split(devtest, test_size=15/40)

#Calculate churn percentages
churn_rate_all = df_fa['is_churn'].sum() / df_fa['is_churn'].count()
churn_rate_train = train['is_churn'].sum() / train['is_churn'].count()
churn_rate_dev = dev['is_churn'].sum() / dev['is_churn'].count()
churn_rate_test = test['is_churn'].sum() / test['is_churn'].count()
#Print churn percentages
print('Check churn percentages:')
print('  All data, % churn: {:.1%}'.format(churn_rate_all))
print('Train data, % churn: {:.1%}'.format(churn_rate_train))
print('  Dev data, % churn: {:.1%}'.format(churn_rate_dev))
print(' Test data, % churn: {:.1%}'.format(churn_rate_test))
Check churn percentages:
  All data, % churn: 50.6%
Train data, % churn: 50.7%
  Dev data, % churn: 50.4%
 Test data, % churn: 50.2%

Training data is fine at 50% churn (we get more training examples for churn) Changing Dev and Test back to real world (6%)

#Reduce dev set to 6% churn
#Select x rows is_churn == 1; append to all rows where is_churn == 0
churn_rate_actual = 0.11  #Emperically this works
dev_churn_split_factor = (churn_rate_dev * churn_rate_actual) / (1 - churn_rate_actual)
dummy, dev_sub = train_test_split(dev[dev.is_churn==1], test_size=dev_churn_split_factor)
# dev = pd.concat([dev[dev.is_churn==0], dev_sub], ignore_index=True)
# We'll not ignore the index. We want msno as the index
dev = pd.concat([dev[dev.is_churn==0], dev_sub])
# Test
dev.head()
city bd registered_via registration_init_time is_churn date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_is_cancel latest_amount_paid_per_day date_featuresdatemax_date_year date_featuresdatemax_date_month date_featuresdatemax_date_day date_featuresdatemax_date_absday date_featuresdatemin_date_year date_featuresdatemin_date_month date_featuresdatemin_date_day date_featuresdatemin_date_absday female male latest_transaction_date_year latest_transaction_date_month latest_transaction_date_day latest_transaction_date_absday latest_expire_date_year latest_expire_date_month latest_expire_date_day latest_expire_date_absday latest_trans_vs_expire latest_trans_vs_log latest_log_vs_expire registration_time
msno
x5+AtzOZKAtnAJBsCIdAyiRl+1p9nIvAYchIkS4zaS4= 22 31 9 20150202 0 748 2 2 1 330.174 330.174 1 1 0 0 0 0 0 0 1 1 179 44.750000 4 41022.435 10255.608750 31 7.750000 4 1.000000 0 0.000000 6 1.500000 153 38.250000 193 38.600000 5 43158.296 8631.659200 34 6.800000 11 2.200000 0 0.000000 6 1.200000 158 31.600000 275 22.916667 12 59410.879 4950.906583 52 4.333333 18 1.500000 7 0.583333 7 0.583333 214 17.833333 1687 33.740000 50 372126.228 7442.524560 371 7.420000 79 1.580000 52 1.040000 50 1.000000 1303 26.060000 4329 36.686441 118 997354.598 8452.157610 719 6.093220 137 1.161017 92 0.779661 118 1.000000 3595 30.466102 8460 35.696203 237 1947010.556 8215.234414 1507 6.358650 272 1.147679 155 0.654008 171 0.721519 7136 30.109705 21497 39.883117 539 5178191.657 9607.034614 3037 5.634508 686 1.272727 365 0.677180 367 0.680891 19132 35.495362 555 2682 4.832432 0 0.0 27 30 149 149 1 0 4.966667 2017 2 19 6259 2015 2 2 5511 0 1 2017 2 1 6241 2017 2 28 6268 -27 -18 -9 2015
WiVvUGUuxmRviEX69svzHUC/zhpyJZdAm3ZyExXsjHA= 17 40 9 20071006 0 787 46 46 1 11779.310 11779.310 1 1 0 0 0 0 0 0 46 46 105 26.250000 4 26890.522 6722.630500 2 0.500000 0 0.000000 2 0.500000 0 0.000000 104 26.000000 145 20.714286 7 36302.890 5186.127143 7 1.000000 1 0.142857 2 0.285714 2 0.285714 137 19.571429 276 25.090909 11 55003.399 5000.309000 67 6.090909 29 2.636364 9 0.818182 6 0.545455 188 17.090909 1425 24.568966 58 322053.197 5552.641328 275 4.741379 170 2.931034 74 1.275862 40 0.689655 1101 18.982759 2771 23.091667 120 596865.459 4973.878825 617 5.141667 347 2.891667 140 1.166667 110 0.916667 2002 16.683333 6125 26.864035 228 1388082.365 6088.080548 1567 6.872807 538 2.359649 255 1.118421 229 1.004386 4862 21.324561 26479 43.408197 610 6661356.419 10920.256425 5362 8.790164 2650 4.344262 1323 2.168852 1318 2.160656 23024 37.744262 543 2831 5.213628 0 0.0 39 30 149 149 1 0 4.966667 2017 2 27 6267 2015 1 2 5480 1 0 2017 1 31 6240 2017 3 12 6280 -40 -27 -13 2007
ur0rGRoV2XJOYpNbzl5n/jBEV9PrKDwZX4QeO03gXl8= 6 21 4 20160822 0 189 36 36 1 9240.939 9240.939 1 1 0 0 0 0 0 0 35 35 371 61.833333 6 107030.392 17838.398667 7 1.166667 3 0.500000 0 0.000000 1 0.166667 415 69.166667 756 68.727273 11 220080.304 20007.300364 58 5.272727 27 2.454545 5 0.454545 7 0.636364 843 76.636364 1766 65.407407 27 590614.385 21874.606852 138 5.111111 74 2.740741 15 0.555556 16 0.592593 2275 84.259259 5486 66.096386 83 1899764.373 22888.727386 493 5.939759 116 1.397590 44 0.530120 41 0.493976 7351 88.566265 11591 68.994048 168 3408784.787 20290.385637 1131 6.732143 198 1.178571 90 0.535714 160 0.952381 13134 78.178571 12290 69.044944 178 3576194.197 20090.978635 1290 7.247191 205 1.151685 99 0.556180 178 1.000000 13798 77.516854 12290 69.044944 178 3576194.197 20090.978635 1290 7.247191 205 1.151685 99 0.556180 178 1.000000 13798 77.516854 210 1043 4.966667 0 0.0 39 30 149 149 1 0 4.966667 2017 2 27 6267 2016 8 22 6078 1 0 2017 1 31 6240 2017 3 26 6294 -54 -27 -27 2016
M/PccoJW/A9myX+eCodcY8Z4LMD1r+d6YKzUNv4PMZo= 1 0 7 20130610 0 789 15 15 1 3416.639 3416.639 0 0 0 0 0 0 0 0 16 16 204 29.142857 7 88005.151 12572.164429 5 0.714286 3 0.428571 2 0.285714 2 0.285714 395 56.428571 441 31.500000 14 168815.462 12058.247286 52 3.714286 31 2.214286 20 1.428571 6 0.428571 735 52.500000 781 26.033333 30 303585.197 10119.506567 118 3.933333 91 3.033333 41 1.366667 10 0.333333 1282 42.733333 1984 27.178082 73 581820.141 7970.138918 217 2.972603 178 2.438356 69 0.945205 24 0.328767 2343 32.095890 3501 25.554745 137 932913.407 6809.586912 373 2.722628 282 2.058394 106 0.773723 50 0.364964 3661 26.722628 7250 26.459854 274 1818035.906 6635.167540 768 2.802920 504 1.839416 196 0.715328 116 0.423358 7089 25.872263 17809 27.440678 649 4533034.736 6984.645202 1352 2.083205 937 1.443760 336 0.517720 198 0.305085 17432 26.859784 840 3631 4.322619 0 0.0 41 30 99 99 1 0 3.300000 2017 2 28 6268 2015 1 1 5479 0 0 2017 2 24 6264 2017 3 25 6293 -29 -4 -25 2013
BZbN3U+ghA0lwOA34yF/GNHbJb73T48nEZGHc4bcikc= 9 31 9 20080410 0 788 25 25 1 13113.138 13113.138 14 14 1 1 5 5 3 3 93 93 101 16.833333 6 33808.385 5634.730833 29 4.833333 2 0.333333 12 2.000000 6 1.000000 165 27.500000 157 13.083333 12 48485.571 4040.464250 40 3.333333 4 0.333333 14 1.166667 9 0.750000 219 18.250000 484 21.043478 23 139573.866 6068.428957 97 4.217391 21 0.913043 23 1.000000 20 0.869565 554 24.086957 1416 22.838710 62 380409.666 6135.639774 213 3.435484 51 0.822581 50 0.806452 59 0.951613 1583 25.532258 4484 30.093960 149 1179486.041 7916.013698 791 5.308725 162 1.087248 143 0.959732 259 1.738255 4752 31.892617 8142 26.607843 306 2174747.697 7107.018618 1684 5.503268 343 1.120915 312 1.019608 519 1.696078 8373 27.362745 12775 20.942623 610 3287076.077 5388.649307 2773 4.545902 602 0.986885 543 0.890164 926 1.518033 12292 20.150820 480 3278 6.829167 0 0.0 34 30 149 149 1 0 4.966667 2017 2 27 6267 2015 1 1 5479 1 0 2017 2 28 6268 2017 3 31 6299 -31 1 -32 2008
#Reduce test set to 6% churn
test_churn_split_factor = (churn_rate_test * churn_rate_actual) / (1 - churn_rate_actual)
dummy, test_sub = train_test_split(test[test.is_churn==1], test_size=test_churn_split_factor)
# test = pd.concat([test[test.is_churn==0], test_sub], ignore_index=True)
test = pd.concat([test[test.is_churn==0], test_sub])
# Test
test.head()
city bd registered_via registration_init_time is_churn date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_is_cancel latest_amount_paid_per_day date_featuresdatemax_date_year date_featuresdatemax_date_month date_featuresdatemax_date_day date_featuresdatemax_date_absday date_featuresdatemin_date_year date_featuresdatemin_date_month date_featuresdatemin_date_day date_featuresdatemin_date_absday female male latest_transaction_date_year latest_transaction_date_month latest_transaction_date_day latest_transaction_date_absday latest_expire_date_year latest_expire_date_month latest_expire_date_day latest_expire_date_absday latest_trans_vs_expire latest_trans_vs_log latest_log_vs_expire registration_time
msno
pGYTw4lPrmjHN9lLR7InnPhzwrtsEI+1auPYpgAk6b0= 1 0 7 20160108 0 363 2 2 1 368.661 368.661 0 0 0 0 0 0 1 1 1 1 24 12.000000 2 11875.661 5937.830500 0 0.000000 0 0.000000 0 0.000000 1 0.500000 57 28.500000 46 15.333333 3 20711.377 6903.792333 1 0.333333 1 0.333333 0 0.000000 1 0.333333 99 33.000000 190 14.615385 13 56847.506 4372.885077 8 0.615385 3 0.230769 2 0.153846 4 0.307692 267 20.538462 268 14.888889 18 81202.505 4511.250278 14 0.777778 4 0.222222 5 0.277778 9 0.500000 377 20.944444 290 13.181818 22 84854.309 3857.014045 20 0.909091 8 0.363636 8 0.363636 10 0.454545 386 17.545455 605 14.069767 43 158621.515 3688.872442 55 1.279070 21 0.488372 20 0.465116 19 0.441860 663 15.418605 605 14.069767 43 158621.515 3688.872442 55 1.279070 21 0.488372 20 0.465116 19 0.441860 663 15.418605 420 1386 3.300000 0 0.0 41 30 99 99 1 0 3.300000 2017 1 23 6232 2016 1 26 5869 0 0 2017 2 8 6248 2017 3 8 6276 -28 16 -44 2016
BLQQH7Gf4iOW4DhlyTBC1YwJgYXEgYkq/0L8VCO/w74= 15 28 7 20130111 0 227 21 21 1 7849.636 7849.636 0 0 0 0 0 0 1 1 31 31 95 23.750000 4 28433.210 7108.302500 0 0.000000 1 0.250000 2 0.500000 4 1.000000 106 26.500000 158 15.800000 10 48280.951 4828.095100 16 1.600000 7 0.700000 2 0.200000 14 1.400000 173 17.300000 458 17.615385 26 164517.024 6327.577846 44 1.692308 33 1.269231 13 0.500000 43 1.653846 590 22.692308 1819 27.149254 67 513447.609 7663.397149 226 3.373134 139 2.074627 86 1.283582 97 1.447761 1808 26.985075 4124 32.992000 125 1148673.053 9189.384424 435 3.480000 219 1.752000 147 1.176000 194 1.552000 4106 32.848000 6030 37.453416 161 1588085.229 9863.883410 780 4.844720 540 3.354037 303 1.881988 273 1.695652 5519 34.279503 6030 37.453416 161 1588085.229 9863.883410 780 4.844720 540 3.354037 303 1.881988 273 1.695652 5519 34.279503 277 1043 3.765343 0 0.0 40 30 149 149 1 1 4.966667 2017 2 28 6268 2016 7 16 6041 0 1 2017 2 21 6261 2017 2 20 6260 1 -7 8 2013
zm8t3xu/h5PxWJZw6A88Dp1lrzIdEPmqQkKVGhsVzpU= 13 47 7 20161203 0 43 8 8 1 463.075 463.075 6 6 0 0 1 1 0 0 1 1 8 8.000000 1 463.075 463.075000 6 6.000000 0 0.000000 1 1.000000 0 0.000000 1 1.000000 47 7.833333 6 4970.152 828.358667 38 6.333333 4 0.666667 1 0.166667 3 0.500000 15 2.500000 214 11.888889 18 38819.765 2156.653611 105 5.833333 18 1.000000 7 0.388889 11 0.611111 148 8.222222 464 17.185185 27 105892.311 3921.937444 286 10.592593 53 1.962963 15 0.555556 23 0.851852 411 15.222222 464 17.185185 27 105892.311 3921.937444 286 10.592593 53 1.962963 15 0.555556 23 0.851852 411 15.222222 464 17.185185 27 105892.311 3921.937444 286 10.592593 53 1.962963 15 0.555556 23 0.851852 411 15.222222 464 17.185185 27 105892.311 3921.937444 286 10.592593 53 1.962963 15 0.555556 23 0.851852 411 15.222222 120 396 3.300000 0 0.0 41 30 99 99 1 0 3.300000 2017 2 27 6267 2017 1 15 6224 1 0 2017 2 12 6252 2017 3 12 6280 -28 -15 -13 2016
oGtvKgIb+1vvcTTPdZWFyeyoUchFtc+9D+KOfR+DIdg= 1 0 7 20160106 0 213 11 11 1 502.463 502.463 8 8 3 3 0 0 0 0 0 0 14 7.000000 2 982.412 491.206000 9 4.500000 3 1.500000 0 0.000000 1 0.500000 1 0.500000 25 8.333333 3 2976.693 992.231000 11 3.666667 4 1.333333 0 0.000000 2 0.666667 8 2.666667 26 6.500000 4 2981.738 745.434500 12 3.000000 4 1.000000 0 0.000000 2 0.500000 8 2.000000 196 16.333333 12 47610.576 3967.548000 15 1.250000 6 0.500000 4 0.333333 4 0.333333 183 15.250000 1672 23.222222 72 370688.635 5148.453264 256 3.555556 91 1.263889 44 0.611111 32 0.444444 1405 19.513889 2073 20.126214 103 453956.260 4407.342330 384 3.728155 126 1.223301 74 0.718447 49 0.475728 1685 16.359223 2073 20.126214 103 453956.260 4407.342330 384 3.728155 126 1.223301 74 0.718447 49 0.475728 1685 16.359223 390 1287 3.300000 0 0.0 41 30 99 99 1 0 3.300000 2016 8 6 6062 2016 1 6 5849 0 0 2017 2 5 6245 2017 3 5 6273 -28 183 -211 2016
shHx7K5hJ3W50FoA4BTEQfSyVcuqCidkjCtY21FdTLs= 22 29 3 20150916 0 531 42 42 1 8183.695 8183.695 10 10 1 1 3 3 5 5 24 24 232 33.142857 7 52512.478 7501.782571 51 7.285714 6 0.857143 8 1.142857 15 2.142857 207 29.571429 505 42.083333 12 121942.448 10161.870667 64 5.333333 12 1.000000 14 1.166667 19 1.583333 510 42.500000 806 36.636364 22 181926.222 8269.373727 182 8.272727 22 1.000000 23 1.045455 30 1.363636 724 32.909091 1418 21.164179 67 406302.004 6064.209015 318 4.746269 56 0.835821 39 0.582090 60 0.895522 1548 23.104478 2369 16.451389 144 734855.804 5103.165306 468 3.250000 94 0.652778 63 0.437500 80 0.555556 2760 19.166667 5487 19.052083 288 1727199.616 5997.220889 1096 3.805556 199 0.690972 157 0.545139 188 0.652778 6452 22.402778 5930 17.390029 341 1887033.172 5533.821619 1299 3.809384 228 0.668622 190 0.557185 221 0.648094 6986 20.486804 540 2682 4.966667 0 0.0 40 30 149 149 1 0 4.966667 2017 2 28 6268 2015 9 16 5737 0 1 2017 2 27 6267 2017 3 26 6294 -27 -1 -26 2015
#Split data / labels
train_labels = train['is_churn']
train_data = train.drop('is_churn', axis=1)
dev_labels = dev['is_churn']
dev_data = dev.drop('is_churn', axis=1)
test_labels = test['is_churn']
test_data = test.drop('is_churn', axis=1)
# Validation
print('\nCheck data sizes:')
print('Train data / test: ', train_data.shape, train_labels.shape)
print('  Dev data / test: ', dev_data.shape, dev_labels.shape)
print(' Test data / test: ', test_data.shape, test_labels.shape)

#Baseline (if we guess all 0's, this is what we get)
print('\nBaseline Accuracy (dev): {:.2%}'.format(1-(dev['is_churn'].sum() / dev['is_churn'].count())))
print('Baseline Accuracy (test): {:.2%}'.format(1-(test['is_churn'].sum() / test['is_churn'].count())))
Check data sizes:
Train data / test:  (53126, 159) (53126,)
  Dev data / test:  (11681, 159) (11681,)
 Test data / test:  (7024, 159) (7024,)

Baseline Accuracy (dev): 94.05%
Baseline Accuracy (test): 94.09%

Of note, the overall data has a churn rate of roughly 6% (~6% of users churn, ~94% stay). However, because we want our model to train well on both churned and non-churned users, this data set is split roughly 50/50 between churn and non-churn users. So we use this 50/50 split to on our 'train' data set, but we reduce it back to 6/94 for both our 'dev' and 'test' data sets, by removing most of the churn cases after data is split into train, dev, and test. In our initial models, we had not yet peformed the 50/50 spilt of the training data, and our best recall score (of the dev data) was 78%. Upon making this change to 50/50, our recall score of our best models improved dramattically, to 96%!

Having split our data, we perform some quick inspections:

dev_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 11681 entries, x5+AtzOZKAtnAJBsCIdAyiRl+1p9nIvAYchIkS4zaS4= to IDkK5VQYefRBzy2GAJgs2ChDorWoKcIBPrnBQGOimbA=
Columns: 159 entries, city to registration_time
dtypes: float64(61), int64(96), uint8(2)
memory usage: 14.1+ MB
dev_data.isnull().sum(axis=0)
city                                 0
bd                                   0
registered_via                       0
registration_init_time               0
date_featuresdatelistening_tenure    0
within_days_1num_unqsum              0
within_days_1num_unqmean             0
within_days_1num_unqcount            0
within_days_1total_secssum           0
within_days_1total_secsmean          0
within_days_1num_25sum               0
within_days_1num_25mean              0
within_days_1num_50sum               0
within_days_1num_50mean              0
within_days_1num_75sum               0
within_days_1num_75mean              0
within_days_1num_985sum              0
within_days_1num_985mean             0
within_days_1num_100sum              0
within_days_1num_100mean             0
within_days_7num_unqsum              0
within_days_7num_unqmean             0
within_days_7num_unqcount            0
within_days_7total_secssum           0
within_days_7total_secsmean          0
within_days_7num_25sum               0
within_days_7num_25mean              0
within_days_7num_50sum               0
within_days_7num_50mean              0
within_days_7num_75sum               0
within_days_7num_75mean              0
within_days_7num_985sum              0
within_days_7num_985mean             0
within_days_7num_100sum              0
within_days_7num_100mean             0
within_days_14num_unqsum             0
within_days_14num_unqmean            0
within_days_14num_unqcount           0
within_days_14total_secssum          0
within_days_14total_secsmean         0
within_days_14num_25sum              0
within_days_14num_25mean             0
within_days_14num_50sum              0
within_days_14num_50mean             0
within_days_14num_75sum              0
within_days_14num_75mean             0
within_days_14num_985sum             0
within_days_14num_985mean            0
within_days_14num_100sum             0
within_days_14num_100mean            0
within_days_31num_unqsum             0
within_days_31num_unqmean            0
within_days_31num_unqcount           0
within_days_31total_secssum          0
within_days_31total_secsmean         0
within_days_31num_25sum              0
within_days_31num_25mean             0
within_days_31num_50sum              0
within_days_31num_50mean             0
within_days_31num_75sum              0
within_days_31num_75mean             0
within_days_31num_985sum             0
within_days_31num_985mean            0
within_days_31num_100sum             0
within_days_31num_100mean            0
within_days_90num_unqsum             0
within_days_90num_unqmean            0
within_days_90num_unqcount           0
within_days_90total_secssum          0
within_days_90total_secsmean         0
within_days_90num_25sum              0
within_days_90num_25mean             0
within_days_90num_50sum              0
within_days_90num_50mean             0
within_days_90num_75sum              0
within_days_90num_75mean             0
within_days_90num_985sum             0
within_days_90num_985mean            0
within_days_90num_100sum             0
within_days_90num_100mean            0
within_days_180num_unqsum            0
within_days_180num_unqmean           0
within_days_180num_unqcount          0
within_days_180total_secssum         0
within_days_180total_secsmean        0
within_days_180num_25sum             0
within_days_180num_25mean            0
within_days_180num_50sum             0
within_days_180num_50mean            0
within_days_180num_75sum             0
within_days_180num_75mean            0
within_days_180num_985sum            0
within_days_180num_985mean           0
within_days_180num_100sum            0
within_days_180num_100mean           0
within_days_365num_unqsum            0
within_days_365num_unqmean           0
within_days_365num_unqcount          0
within_days_365total_secssum         0
within_days_365total_secsmean        0
within_days_365num_25sum             0
within_days_365num_25mean            0
within_days_365num_50sum             0
within_days_365num_50mean            0
within_days_365num_75sum             0
within_days_365num_75mean            0
within_days_365num_985sum            0
within_days_365num_985mean           0
within_days_365num_100sum            0
within_days_365num_100mean           0
within_days_9999num_unqsum           0
within_days_9999num_unqmean          0
within_days_9999num_unqcount         0
within_days_9999total_secssum        0
within_days_9999total_secsmean       0
within_days_9999num_25sum            0
within_days_9999num_25mean           0
within_days_9999num_50sum            0
within_days_9999num_50mean           0
within_days_9999num_75sum            0
within_days_9999num_75mean           0
within_days_9999num_985sum           0
within_days_9999num_985mean          0
within_days_9999num_100sum           0
within_days_9999num_100mean          0
total_plan_days                      0
total_amount_paid                    0
amount_paid_per_day                  0
diff_renewal_duration                0
diff_plan_amount_paid_per_day        0
latest_payment_method_id             0
latest_plan_days                     0
latest_plan_list_price               0
latest_amount_paid                   0
latest_auto_renew                    0
latest_is_cancel                     0
latest_amount_paid_per_day           0
date_featuresdatemax_date_year       0
date_featuresdatemax_date_month      0
date_featuresdatemax_date_day        0
date_featuresdatemax_date_absday     0
date_featuresdatemin_date_year       0
date_featuresdatemin_date_month      0
date_featuresdatemin_date_day        0
date_featuresdatemin_date_absday     0
female                               0
male                                 0
latest_transaction_date_year         0
latest_transaction_date_month        0
latest_transaction_date_day          0
latest_transaction_date_absday       0
latest_expire_date_year              0
latest_expire_date_month             0
latest_expire_date_day               0
latest_expire_date_absday            0
latest_trans_vs_expire               0
latest_trans_vs_log                  0
latest_log_vs_expire                 0
registration_time                    0
dtype: int64
dev_data.describe(include='all')
city bd registered_via registration_init_time date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_is_cancel latest_amount_paid_per_day date_featuresdatemax_date_year date_featuresdatemax_date_month date_featuresdatemax_date_day date_featuresdatemax_date_absday date_featuresdatemin_date_year date_featuresdatemin_date_month date_featuresdatemin_date_day date_featuresdatemin_date_absday female male latest_transaction_date_year latest_transaction_date_month latest_transaction_date_day latest_transaction_date_absday latest_expire_date_year latest_expire_date_month latest_expire_date_day latest_expire_date_absday latest_trans_vs_expire latest_trans_vs_log latest_log_vs_expire registration_time
count 11681.000000 11681.000000 11681.000000 1.168100e+04 11681.000000 11681.000000 11681.000000 11681.0 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.00000 11681.000000 11681.000000 1.168100e+04 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 1.168100e+04 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 1.168100e+04 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 1.168100e+04 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 1.168100e+04 1.168100e+04 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 1.168100e+04 1.168100e+04 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000 11681.000000
mean 5.826727 13.443712 6.935536 2.013228e+07 514.284479 22.295009 22.295009 1.0 5841.428795 5841.428795 5.029535 5.029535 1.189967 1.189967 0.743258 0.743258 0.839055 0.839055 22.308963 22.308963 111.650372 22.691406 4.167109 30503.988013 5985.177092 23.840938 5.122138 5.884342 1.257663 3.699854 0.775567 4.381988 0.882703 116.822104 22.752969 211.99221 22.719871 7.785720 5.828727e+04 6014.422805 45.381132 5.193618 11.183289 1.276905 7.068487 0.781544 8.186114 0.874601 223.294067 22.854394 452.812602 23.127815 16.307337 1.236928e+05 6085.946547 97.905830 5.332152 24.019262 1.308603 15.254601 0.806773 17.757298 0.900429 473.865679 23.104524 1318.713552 23.924663 46.052050 3.594315e+05 6305.510455 291.725880 5.636081 70.684017 1.378342 45.068659 0.848118 51.324972 0.930059 1373.024655 23.880023 2541.436093 24.333497 87.509545 6.874662e+05 6385.243171 573.163428 5.812711 137.585053 1.428488 87.280284 0.871057 99.230717 0.948863 2611.563993 24.068495 4755.631624 24.737577 161.709357 -3.948022e+12 -1.532762e+10 1078.856776 5.960090 258.348087 1.471240 162.010273 0.885842 183.662272 0.956174 4859.338670 24.379361 8451.857803 25.175002 279.380104 -3.963815e+14 -7.873946e+11 1888.557315 6.049168 455.579402 1.507384 283.676826 0.901029 319.495163 0.963754 8641.108980 24.792486 485.222327 2256.692492 4.543509 -0.072768 0.012520 38.756956 35.180978 152.782210 152.550381 0.874583 0.027395 4.379098 2016.894444 2.516480 24.400822 6241.423166 2015.381217 4.273264 10.564506 5727.138687 0.211797 0.235254 2016.962503 2.059755 17.265902 6245.262135 2016.997089 2.901293 16.254944 6280.439346 -35.177211 3.838969 -39.016180 2013.161031
std 6.345094 18.346119 1.899316 2.984823e+04 270.362724 27.289473 27.289473 0.0 8142.475535 8142.475535 12.678597 12.678597 2.871737 2.871737 1.536068 1.536068 2.135428 2.135428 33.871304 33.871304 136.367988 21.627585 2.148945 43854.406776 6856.053431 44.162816 9.053400 10.632452 2.175187 5.571651 1.087507 14.222665 2.166218 184.086949 28.805492 258.14289 20.281673 4.420964 8.320901e+04 6553.882438 78.772433 8.337512 18.067398 1.992995 10.110025 0.975385 22.794720 1.916294 347.684326 27.606710 535.120295 18.870724 9.586211 1.718042e+05 6076.212087 157.448568 7.413234 34.991492 1.694762 20.295165 0.858649 44.748413 1.836051 723.190582 25.730783 1517.857170 18.160695 27.728096 4.829295e+05 5968.782229 530.007545 7.367072 96.768111 1.495532 58.757332 0.825757 103.773302 1.558319 2005.149579 25.270982 2919.897164 17.656294 54.891995 9.095613e+05 5808.165290 1734.932914 10.458405 189.308082 1.479933 111.811107 0.802158 214.083712 1.619215 3713.410676 24.500432 5632.153118 17.528812 109.365613 1.907921e+14 7.496272e+11 5003.389049 14.538201 373.301292 1.474342 212.337775 0.765136 346.952630 1.211172 6845.962816 23.693800 10989.815430 17.273936 222.393064 4.817483e+15 8.540636e+12 8821.608529 13.306385 713.431408 1.477271 407.737201 0.756774 564.771146 1.035766 13023.760298 23.003918 230.848005 1230.002058 1.051432 28.009957 0.528422 3.769322 38.381381 162.935104 163.028018 0.331206 0.163238 0.969957 0.357509 2.062157 6.319689 89.576130 0.544543 3.891219 9.562939 255.133800 0.408599 0.424176 0.209692 1.358014 9.357640 47.726121 0.056964 0.405357 9.199672 20.838048 40.945475 100.327736 89.066643 2.987074
min 1.000000 0.000000 3.000000 2.004033e+07 0.000000 1.000000 1.000000 1.0 0.255000 0.255000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.255000 0.255000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.00000 1.000000 1.000000 2.550000e-01 0.255000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 7.620000e-01 0.762000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 7.620000e-01 0.762000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 7.620000e-01 0.762000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -9.223372e+15 -4.521261e+13 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 -3.135946e+17 -4.077954e+14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 7.000000 0.000000 0.000000 -400.000000 -6.000000 10.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2015.000000 1.000000 1.000000 5482.000000 2015.000000 1.000000 1.000000 5479.000000 0.000000 0.000000 2015.000000 1.000000 1.000000 5729.000000 2015.000000 1.000000 1.000000 5791.000000 -522.000000 -538.000000 -813.000000 2004.000000
25% 1.000000 0.000000 7.000000 2.012021e+07 274.000000 5.000000 5.000000 1.0 1035.354000 1035.354000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 3.000000 22.000000 8.666667 2.000000 5144.756000 1989.096750 3.000000 1.000000 1.000000 0.200000 0.000000 0.000000 0.000000 0.000000 18.000000 6.800000 43.00000 10.000000 4.000000 1.006196e+04 2297.794167 6.000000 1.333333 1.000000 0.333333 1.000000 0.200000 1.000000 0.200000 35.000000 8.000000 100.000000 11.307692 8.000000 2.419783e+04 2669.606818 16.000000 1.727273 4.000000 0.458333 3.000000 0.300000 3.000000 0.307692 85.000000 9.333333 309.000000 12.625000 21.000000 7.378727e+04 3015.113367 53.000000 2.117647 14.000000 0.571429 9.000000 0.387097 9.000000 0.392157 258.000000 10.553846 578.000000 13.242857 38.000000 1.407210e+05 3138.226153 106.000000 2.295082 28.000000 0.619048 19.000000 0.422535 19.000000 0.425926 499.000000 11.021978 1011.000000 13.879630 63.000000 2.432913e+05 3.271425e+03 192.000000 2.485714 52.000000 0.666667 34.000000 0.452128 34.000000 0.452586 865.000000 11.553398 1440.000000 14.409774 82.000000 2.978761e+05 3.232108e+03 283.000000 2.615385 77.000000 0.707483 49.000000 0.473180 49.000000 0.470721 1220.000000 11.852590 300.000000 1192.000000 3.300000 0.000000 0.000000 38.000000 30.000000 99.000000 99.000000 1.000000 0.000000 3.300000 2017.000000 2.000000 24.000000 6262.000000 2015.000000 1.000000 2.000000 5480.000000 0.000000 0.000000 2017.000000 2.000000 9.000000 6243.000000 2017.000000 3.000000 8.000000 6273.000000 -31.000000 -22.000000 -29.000000 2012.000000
50% 1.000000 0.000000 7.000000 2.014052e+07 554.000000 13.000000 13.000000 1.0 3004.783000 3004.783000 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 11.000000 11.000000 68.000000 17.000000 4.000000 16648.129000 4122.421714 10.000000 2.833333 3.000000 0.714286 2.000000 0.500000 2.000000 0.500000 61.000000 15.000000 130.00000 17.750000 8.000000 3.209375e+04 4286.554286 21.000000 3.000000 6.000000 0.818182 4.000000 0.538462 4.000000 0.571429 118.000000 15.800000 291.000000 18.705882 17.000000 7.235517e+04 4544.279786 49.000000 3.500000 13.000000 0.916667 9.000000 0.611111 9.000000 0.625000 264.000000 16.800000 868.000000 19.787234 47.000000 2.160170e+05 4801.588723 156.000000 3.873239 41.000000 1.000000 27.000000 0.666667 28.000000 0.683333 793.000000 17.769231 1667.000000 20.263158 87.000000 4.120246e+05 4944.808905 301.000000 4.098765 81.000000 1.042553 53.000000 0.689189 55.000000 0.710227 1513.000000 18.123967 3051.000000 20.703180 153.000000 7.594609e+05 5.081905e+03 560.000000 4.229050 152.000000 1.100000 99.000000 0.713178 101.000000 0.733813 2786.000000 18.671875 4851.000000 21.420779 228.000000 1.129275e+06 5.141641e+03 894.000000 4.393484 246.000000 1.145631 157.000000 0.737609 162.000000 0.749035 4455.000000 19.231343 480.000000 2117.000000 4.850000 0.000000 0.000000 41.000000 30.000000 149.000000 149.000000 1.000000 0.000000 4.966667 2017.000000 2.000000 27.000000 6267.000000 2015.000000 2.000000 7.000000 5684.000000 0.000000 0.000000 2017.000000 2.000000 18.000000 6253.000000 2017.000000 3.000000 16.000000 6283.000000 -28.000000 -10.000000 -19.000000 2014.000000
75% 13.000000 27.000000 9.000000 2.016011e+07 786.000000 29.000000 29.000000 1.0 7237.710000 7237.710000 5.000000 5.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 27.000000 27.000000 150.000000 29.714286 6.000000 38096.915000 7444.710333 28.000000 6.000000 7.000000 1.500000 5.000000 1.000000 5.000000 1.000000 142.000000 28.000000 283.00000 29.142857 12.000000 7.219161e+04 7302.134333 55.000000 6.214286 14.000000 1.555556 9.000000 1.000000 10.000000 1.076923 269.000000 27.750000 604.000000 29.310345 25.000000 1.554336e+05 7393.735423 120.000000 6.538462 30.000000 1.625000 20.000000 1.033333 21.000000 1.111111 579.000000 27.950000 1795.000000 29.746835 71.000000 4.604427e+05 7549.245559 359.000000 6.902439 92.000000 1.702703 59.000000 1.059524 64.000000 1.129870 1717.000000 28.437500 3468.000000 30.159722 136.000000 8.872624e+05 7643.410298 708.000000 7.084507 180.000000 1.763514 116.000000 1.080882 122.000000 1.152000 3306.000000 28.680556 6457.000000 30.614493 255.000000 1.655791e+06 7.835299e+03 1332.000000 7.212766 337.000000 1.812500 215.000000 1.104046 229.000000 1.161616 6188.000000 29.495098 11311.000000 31.182957 446.000000 2.804311e+06 7.882326e+03 2293.000000 7.281139 591.000000 1.862069 375.000000 1.121495 399.000000 1.175532 10945.000000 29.963964 720.000000 3354.000000 5.165333 0.000000 0.000000 41.000000 30.000000 149.000000 149.000000 1.000000 0.000000 4.966667 2017.000000 2.000000 28.000000 6268.000000 2016.000000 8.000000 18.000000 5936.000000 0.000000 0.000000 2017.000000 2.000000 26.000000 6262.000000 2017.000000 3.000000 24.000000 6291.000000 -28.000000 0.000000 -8.000000 2016.000000
max 22.000000 942.000000 13.000000 2.017022e+07 789.000000 530.000000 530.000000 1.0 121364.969000 121364.969000 570.000000 570.000000 97.000000 97.000000 47.000000 47.000000 117.000000 117.000000 507.000000 507.000000 2101.000000 322.500000 7.000000 604487.476000 121364.969000 2328.000000 338.000000 354.000000 66.000000 109.000000 26.500000 1207.000000 172.428571 3205.000000 507.000000 4287.00000 322.500000 14.000000 1.188867e+06 121364.969000 3623.000000 338.000000 574.000000 66.000000 197.000000 19.000000 1877.000000 156.416667 5765.000000 507.000000 8345.000000 322.500000 31.000000 2.418630e+06 81498.962379 7309.000000 338.000000 1131.000000 66.000000 403.000000 15.454545 3492.000000 151.826087 13493.000000 449.766667 23425.000000 275.588235 90.000000 7.056073e+06 102261.927435 35029.000000 393.584270 3017.000000 33.898876 1207.000000 21.785714 6953.000000 124.160714 35188.000000 443.492754 29562.000000 200.588235 180.000000 1.376347e+07 87110.569297 166936.000000 932.603352 8311.000000 46.430168 3287.000000 18.363128 14946.000000 133.446429 60842.000000 426.651163 77035.000000 249.500000 365.000000 2.178917e+07 7.669030e+04 519451.000000 1427.063187 19246.000000 52.873626 8016.000000 22.021978 15637.000000 65.701681 96565.000000 426.651163 154375.000000 212.053571 789.000000 1.424409e+09 3.050126e+06 911417.000000 1298.314815 37859.000000 53.930199 16436.000000 23.413105 17188.000000 32.588378 251662.000000 360.992857 1200.000000 5669.000000 13.450000 420.000000 6.000000 41.000000 450.000000 1788.000000 1788.000000 1.000000 1.000000 6.000000 2017.000000 12.000000 31.000000 6268.000000 2017.000000 12.000000 31.000000 6268.000000 1.000000 1.000000 2017.000000 12.000000 31.000000 6268.000000 2017.000000 12.000000 31.000000 6299.000000 46.000000 782.000000 353.000000 2017.000000

3. Predictive Modeling!

With our data in good shape, we move on to build predictive models.

We begin by building a couple functions to help automate the evaluation of our models:

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion Matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.

    Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    Documented here as it is in the source.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized Confusion Matrix")
    else:
        print('Confusion Matrix')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, size=20)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45, size=20)
    plt.yticks(tick_marks, classes, size=20)

    fmt = '.1%' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), size=20,
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label', size=20)
    plt.xlabel('Predicted label', size=20)


def summarize_results(classifier, data=dev_data, labels=dev_labels):
    """Function to automate the displaying modeling results.

    Args:
        classifier (a sklearn classifier):  The classifier to plot.

    Kwargs: 
        data (dataframe):  The data on which to predict labels.  Defaults to dev_data.
        labels (dataframe):  The correct labels.  Defaults to dev_labels.

    Returns:
        None, but prints and plots summary metrics.
    """

    #Print Results
    print('Accuracy: {:.2%}'.format(classifier.score(data, labels)))
    print(classification_report(labels, classifier.predict(data)))

    #Plot Results
    class_names = [0, 1]

    # Compute confusion matrix
    cnf_matrix = confusion_matrix(labels, classifier.predict(data))
    np.set_printoptions(precision=2)

    # Plot non-normalized confusion matrix
    plt.figure()
    plot_confusion_matrix(cnf_matrix, classes=class_names,
                          title='Confusion Matrix')

    # Plot normalized confusion matrix
    plt.figure()
    plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                          title='Normalized Confusion Matrix')

    plt.show()

Model Evaluation

We're placing an emphasis on recall as our primary metric, moreso than accuracy. Our thinking here is that accuracy has a 'baseline' of 94% (predicting all 0's, i.e., no users churn), making our current best prediction of ~98% much less impressive. Moreover, we're okay with some false positives but would prefer to minimize false negatives. In other words, we'd rather predict a few customers as likely to churn when in fact they would actually stay (false positives) as opposed to predicting customers who would stay but who actually churn (false negatives). This assumption presumes that the long-term cost of keeping customers (for example, the cost of offering discounts) is less than the long-term loss associated with losing customers. Admittedly, more domain knowledge would be required to validate this assumption, but we consider that validation beyond the scope of the project.

In summary, though we calculate several evaluation metrics below, recall is our primary scoring metric, so long as we have a reasonably low False Positive rate.

Poorly Performing Classifiers

We initially tried a few different models:

  • Gaussian Naive Bayes
  • K-Nearest Neighbors
  • Support Vector Machines

None of these had promising results, as shown in the output of the cell below.

Note, though not shown here, the team explored several tuning options with these classifiers, but none of them performed as well as the classifiers further down.

### NB Attempt ###
clf_NB_Gauss = GaussianNB()
clf_NB_Gauss.fit(train_data, train_labels)
print('NAIVE BAYES CLASSIFIER')
summarize_results(clf_NB_Gauss)

### KNN Attempt ###
print('KNN CLASSIFIER')
clf_neigh = KNeighborsClassifier(n_neighbors=10, n_jobs=8)  #Accuracy plateaus around n=10, all 0's
clf_neigh.fit(train_data, train_labels)
summarize_results(clf_neigh)

#### SVM Attempts ###
print('SVM CLASSIFIER')
clf_SVM = svm.SVC(kernel='rbf', C=1, max_iter=640, probability = True)  #max_iter=635 gives 6% accuracy ... need new approach / tuning
clf_SVM.fit(train_data, train_labels)
summarize_results(clf_SVM)
NAIVE BAYES CLASSIFIER
Accuracy: 93.97%
             precision    recall  f1-score   support

          0       0.94      1.00      0.97     10986
          1       0.09      0.00      0.00       695

avg / total       0.89      0.94      0.91     11681

Confusion Matrix
[[10976    10]
 [  694     1]]
Normalized Confusion Matrix
[[9.99e-01 9.10e-04]
 [9.99e-01 1.44e-03]]

Drawing

Drawing

KNN CLASSIFIER
Accuracy: 67.01%
             precision    recall  f1-score   support

          0       0.95      0.68      0.80     10986
          1       0.08      0.45      0.14       695

avg / total       0.90      0.67      0.76     11681

Confusion Matrix
[[7512 3474]
 [ 379  316]]
Normalized Confusion Matrix
[[0.68 0.32]
 [0.55 0.45]]

Drawing

Drawing

SVM CLASSIFIER


C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:218: ConvergenceWarning: Solver terminated early (max_iter=640).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  % self.max_iter, ConvergenceWarning)


Accuracy: 6.06%
             precision    recall  f1-score   support

          0       1.00      0.00      0.00     10986
          1       0.06      1.00      0.11       695

avg / total       0.94      0.06      0.01     11681

Confusion Matrix
[[   13 10973]
 [    0   695]]
Normalized Confusion Matrix
[[0. 1.]
 [0. 1.]]

Drawing

Drawing

Random Forest Classifier

Having had little success with the classifiers above, we next tried a random forest, which performed very well:

### Random Forest Attempt ###
clf_RF = RandomForestClassifier(n_jobs=8, n_estimators=23)
clf_RF.fit(train_data, train_labels)
summarize_results(clf_RF)
print(clf_RF.get_params())
Accuracy: 97.40%
             precision    recall  f1-score   support

          0       1.00      0.97      0.99     10986
          1       0.70      0.98      0.82       695

avg / total       0.98      0.97      0.98     11681

Confusion Matrix
[[10696   290]
 [   14   681]]
Normalized Confusion Matrix
[[0.97 0.03]
 [0.02 0.98]]

Drawing

Drawing

{'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 23, 'n_jobs': 8, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

Random Forest with GridSearchCV

We next tried Random Forest with GridSearchCV to further tune our parameters:

#RF Classifier with Grid Search
tuned_parameters = [{'n_estimators': [150],
                     'max_features': [20],
                     'min_samples_leaf': [2],

                    }]

clf_GS_RF = GridSearchCV(RandomForestClassifier(n_jobs=8),
                   tuned_parameters,
                   #cv=4,
                   scoring='recall')
clf_GS_RF.fit(train_data, train_labels)
pprint(clf_GS_RF.grid_scores_)
pprint(clf_GS_RF.best_estimator_)
pprint(clf_GS_RF.best_params_)
summarize_results(clf_GS_RF)
[mean: 0.97473, std: 0.00126, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 150}]
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=20, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=8,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
{'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 150}
Accuracy: 98.13%
             precision    recall  f1-score   support

          0       1.00      0.97      0.99     10986
          1       0.70      0.98      0.82       695

avg / total       0.98      0.97      0.98     11681

Confusion Matrix
[[10699   287]
 [   13   682]]
Normalized Confusion Matrix
[[0.97 0.03]
 [0.02 0.98]]

Drawing

Drawing

The best parameters turned out to be max_features = 20, min_samples_leaf = 2, and n_estimators = 150, which produced a recall of 0.96890.

Of these parameters, n_estimators seemed to have the most effect on model performance, but it was still fairly small, and none of the parameters showed much difference (our total range was 0.0027, from 0.9662 - 0.9689).

To keep the run time down, we removed all the tuning trials for running multiple tuning parameters. However, the results of those trials are listed below:

Output of 'print(clf_GS_RF.grid_scores_)':

Tuning max_features, min_samples_leaf, and n_estimators

  • mean: 0.96619, std: 0.00077, params: {'max_features': 20, 'min_samples_leaf': 1, 'n_estimators': 40},
  • mean: 0.96663, std: 0.00076, params: {'max_features': 20, 'min_samples_leaf': 1, 'n_estimators': 50},
  • mean: 0.96782, std: 0.00066, params: {'max_features': 20, 'min_samples_leaf': 1, 'n_estimators': 100},
  • mean: 0.96660, std: 0.00143, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 40},
  • mean: 0.96786, std: 0.00028, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 50},
  • mean: 0.96868, std: 0.00055, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 100},
  • mean: 0.96816, std: 0.00048, params: {'max_features': 20, 'min_samples_leaf': 4, 'n_estimators': 40},
  • mean: 0.96838, std: 0.00060, params: {'max_features': 20, 'min_samples_leaf': 4, 'n_estimators': 50},
  • mean: 0.96853, std: 0.00101, params: {'max_features': 20, 'min_samples_leaf': 4, 'n_estimators': 100},
  • mean: 0.96704, std: 0.00057, params: {'max_features': 20, 'min_samples_leaf': 8, 'n_estimators': 40},
  • mean: 0.96786, std: 0.00032, params: {'max_features': 20, 'min_samples_leaf': 8, 'n_estimators': 50},
  • mean: 0.96734, std: 0.00089, params: {'max_features': 20, 'min_samples_leaf': 8, 'n_estimators': 100},
  • mean: 0.96704, std: 0.00016, params: {'max_features': 40, 'min_samples_leaf': 1, 'n_estimators': 40},
  • mean: 0.96753, std: 0.00068, params: {'max_features': 40, 'min_samples_leaf': 1, 'n_estimators': 50},
  • mean: 0.96819, std: 0.00080, params: {'max_features': 40, 'min_samples_leaf': 1, 'n_estimators': 100},
  • mean: 0.96838, std: 0.00024, params: {'max_features': 40, 'min_samples_leaf': 2, 'n_estimators': 40},
  • mean: 0.96819, std: 0.00064, params: {'max_features': 40, 'min_samples_leaf': 2, 'n_estimators': 50},
  • mean: 0.96860, std: 0.00048, params: {'max_features': 40, 'min_samples_leaf': 2, 'n_estimators': 100},
  • mean: 0.96868, std: 0.00046, params: {'max_features': 40, 'min_samples_leaf': 4, 'n_estimators': 40},
  • mean: 0.96860, std: 0.00009, params: {'max_features': 40, 'min_samples_leaf': 4, 'n_estimators': 50},
  • mean: 0.96827, std: 0.00027, params: {'max_features': 40, 'min_samples_leaf': 4, 'n_estimators': 100},
  • mean: 0.96834, std: 0.00032, params: {'max_features': 40, 'min_samples_leaf': 8, 'n_estimators': 40},
  • mean: 0.96860, std: 0.00069, params: {'max_features': 40, 'min_samples_leaf': 8, 'n_estimators': 50},
  • mean: 0.96864, std: 0.00059, params: {'max_features': 40, 'min_samples_leaf': 8, 'n_estimators': 100}]

Further tuning n_estimators

  • mean: 0.96842, std: 0.00073, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 100},
  • mean: 0.96890, std: 0.00082, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 150},
  • mean: 0.96838, std: 0.00078, params: {'max_features': 20, 'min_samples_leaf': 2, 'n_estimators': 200}]

XGBoost Classifier

Having seen strong performance from Random Forest models, we next tried an XGBoost classifier:

#Basic XGB Classifier
clf_XGB = xgboost.XGBClassifier(n_jobs=8)
clf_XGB.fit(train_data, train_labels)
summarize_results(clf_XGB)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:


Accuracy: 97.11%
             precision    recall  f1-score   support

          0       1.00      0.97      0.98     10986
          1       0.68      0.97      0.80       695

avg / total       0.98      0.97      0.97     11681

Confusion Matrix
[[10667   319]
 [   18   677]]
Normalized Confusion Matrix
[[0.97 0.03]
 [0.03 0.97]]

Drawing

Drawing

The results of the XGBoost classifier were also very promising. With no tuning, they weren't quite as good as the Random Forest, but they were very close.

XGBoost with GridSearchCV

We next tried XGBoost with GridSearchCV to further tune our parameters:

#XGB Classifier with Grid Search
tuned_parameters = [{'reg_lambda': [0.01],
                     #'learning_rate': [0.01, 0.1, 1],
                     #'max_depth': [3, 5, 7, 9],
                     'max_depth': [5],  #Landed on 5
                     #'min_child_weight': [1, 3, 5],
                     'min_child_weight': [1],  #Landed on 1
                     #'gamma':[i/10.0 for i in range(0,5)],
                     'gamma':[0.01],  #Landed on 0.01
                     #'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100], #NEXT TRY THIS, BUT NOT WITH COMBO ABOVE
                     'reg_alpha':[0.1],
                    }]

clf_GS_XGB = GridSearchCV(xgboost.XGBClassifier(n_jobs=8),
                   tuned_parameters,
                   #cv=4,
                   scoring='recall')
clf_GS_XGB.fit(train_data, train_labels)
summarize_results(clf_GS_XGB)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:


Accuracy: 97.84%
             precision    recall  f1-score   support

          0       1.00      0.98      0.99     10986
          1       0.71      0.98      0.83       695

avg / total       0.98      0.98      0.98     11681

Confusion Matrix
[[10713   273]
 [   15   680]]
Normalized Confusion Matrix
[[0.98 0.02]
 [0.02 0.98]]

Drawing

Drawing

From the code and results shown above, we were able to get results that slightly exceeded the Random Forest model.

To keep the run time down, we commented out the cells that ran multiple tuning parameters. However, the results of those trials are as follows:

Output of 'print(clf_GS_XGB.grid_scores_)':

Tuning max depth and min child weight

  • mean: 0.96971, std: 0.00072, params: {'max_depth': 3, 'min_child_weight': 1, 'reg_lambda': 0.01}
  • mean: 0.96990, std: 0.00064, params: {'max_depth': 3, 'min_child_weight': 3, 'reg_lambda': 0.01}
  • mean: 0.96990, std: 0.00061, params: {'max_depth': 3, 'min_child_weight': 5, 'reg_lambda': 0.01}
  • mean: 0.97012, std: 0.00023, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_lambda': 0.01}
  • mean: 0.97042, std: 0.00014, params: {'max_depth': 5, 'min_child_weight': 3, 'reg_lambda': 0.01}
  • mean: 0.96983, std: 0.00055, params: {'max_depth': 5, 'min_child_weight': 5, 'reg_lambda': 0.01}
  • mean: 0.96994, std: 0.00048, params: {'max_depth': 7, 'min_child_weight': 1, 'reg_lambda': 0.01}
  • mean: 0.97038, std: 0.00125, params: {'max_depth': 7, 'min_child_weight': 3, 'reg_lambda': 0.01}
  • mean: 0.97031, std: 0.00026, params: {'max_depth': 7, 'min_child_weight': 5, 'reg_lambda': 0.01}
  • mean: 0.97038, std: 0.00096, params: {'max_depth': 9, 'min_child_weight': 1, 'reg_lambda': 0.01}
  • mean: 0.97057, std: 0.00115, params: {'max_depth': 9, 'min_child_weight': 3, 'reg_lambda': 0.01}
  • mean: 0.97038, std: 0.00024, params: {'max_depth': 9, 'min_child_weight': 5, 'reg_lambda': 0.01}

Tuning reg_alpha (with optimal values from above)

  • mean: 0.97005, std: 0.00018, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 1e-05, 'reg_lambda': 0.01}
  • mean: 0.97012, std: 0.00052, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.01, 'reg_lambda': 0.01}
  • mean: 0.97016, std: 0.00016, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
  • mean: 0.97009, std: 0.00045, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 1, 'reg_lambda': 0.01}
  • mean: 0.96738, std: 0.00057, params: {'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 100, 'reg_lambda': 0.01}

Tuning gamma (with optimal values from above)

  • mean: 0.97016, std: 0.00016, params: {'gamma': 0.0, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
  • mean: 0.97020, std: 0.00029, params: {'gamma': 0.1, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
  • mean: 0.97001, std: 0.00028, params: {'gamma': 0.2, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
  • mean: 0.97016, std: 0.00083, params: {'gamma': 0.3, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
  • mean: 0.97012, std: 0.00056, params: {'gamma': 0.4, 'max_depth': 5, 'min_child_weight': 1, 'reg_alpha': 0.1, 'reg_lambda': 0.01}
print(clf_GS_XGB.score(dev_data, dev_labels))
0.9784172661870504


C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

Final Run on Test Data

The cell below runs our best model on the not-yet-touched test data:

summarize_results(clf_GS_XGB, test_data, test_labels)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:


Accuracy: 98.31%
             precision    recall  f1-score   support

          0       1.00      0.98      0.99      6609
          1       0.71      0.98      0.83       415

avg / total       0.98      0.98      0.98      7024

Confusion Matrix
[[6444  165]
 [   7  408]]
Normalized Confusion Matrix
[[0.98 0.02]
 [0.02 0.98]]

Drawing

Drawing

The test results confirm the same strong findings we saw in the dev data.

Modeling Results

Here are the key points summarizing our predictive modeling findings:

  • XGBoost worked the best, slightly outperforming Random Forest. No other model we tried came close to their results.
  • Scores:
    • We achieved a recall of 97.8% with the dev data, and 98.3% with the test data (not used to tune any of the models), which resulted in correctly predicting 680 / 408 users who churned and only missing 15 / 7 in the dev / test data, respectively.
    • Our false positive rate was 2.5% in both the dev and test data, incorrectly predicting 273 / 165 users who actually did not churn in the dev / test data, respectively. The economic modeling below will give more insight into if these levels are acceptable, but they seem quite good for now.
  • In our initial models, we had not yet peformed the 50/50 spilt of the training data, and our best recall score (of the dev data) was 78%. Upon making this change to 50/50, our recall score of our best models improved dramattically, to 96%!
  • Additional feature engineering proved useful also. Notably, adding features of date interactions (expiry date, last transaction, and last usage, along with the differences among these dates), reduced false positives in our dev data from 415 to 273, a big improvement.

4. Calculating Probabilities

Our model has shown very promising results both in terms of recall and accuracy, meaning we can accurately predict which customers will churn. However from a business perspective, we would also like to go further and look at the probability of churn, in order to determine how much should be spent to prevent churn.

When looking at probility we want to ensure that it is accurately calibrated (ie when the model predicts 60% probability of churn, 60% of customers did churn). We can do this by creating a calibration plot.

When we performed the train/dev/test split, we kept a 50% churn proportion in the train dataset, but a 6% (native) churn proportion in both the dev and test splits. Therefore, because we trained on the train dataset, but our proportions are very different between train/dev datasets, our calibration is very innacurate:

def plot_calibration(models, testing_data, testing_labels, title):
    """ This function plots calibration plot using sklearn packages in order to visualize
    the efficiency in which models predict probabilities. It is based on the work presented
    in the sklearn documentation (http://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html).

    Also prints the brier score for reference.

    Args:
        models: A list of tuples that contain the pre-fit model (with predict_proba as a method)
            as well as a string of the name of the model
        testing_data: Data used for testing the pre-fit model. **Data should not have previously
            been used for testing
        testing_labels: Labels for the testing data
        title: String to be used as title for plot

    Returns:
        N/A: Prints plot to screen


    """
    plt.figure(figsize=(9, 9))
    ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    ax2 = plt.subplot2grid((3, 1), (2, 0))
    ax1.plot([0,1], [0,1], 'k:', label='Perfect Calibration')
    for clf, name in models:
        #Get probabilities for the specific model using the test dataset
        prob_pos = clf.predict_proba(testing_data)[:,1]
        #Use sklearn calibraiton_curve implementation to extract data to plot in calibration curve
        frac_pos, mean_pred = calibration_curve(testing_labels, prob_pos, n_bins = 10)
        ax1.plot(mean_pred, frac_pos, "s-", label='%s' %(name,))
        ax2.hist(prob_pos, range=(0,1), bins=10, label=name, histtype='step', lw=2)
        #Print the Brier Score - used for quantifying calibration success
        print("%s Brier Score: %1.3f" %(name, brier_score_loss(testing_labels, prob_pos)))
    ax1.set_ylabel("Fraction of positives")
    ax1.set_ylim([-0.05, 1.05])
    ax1.legend(loc="lower right")
    ax1.set_title(title)

    ax2.set_xlabel("Mean predicted value")
    ax2.set_ylabel("Count")
    ax2.legend(loc="upper center", ncol=2)

    plt.tight_layout()
#Baseline score using models that were previously computed to optimize for recall and accuracy
model = [(clf_NB_Gauss, 'Naive Bayes'), 
                  (clf_neigh, 'Nearest Neighbors'),
                  (clf_RF, 'Random Forest'),
                  (clf_SVM, 'Support Vector Machine'),
                  (clf_XGB, 'XG Boost'),
                  (clf_GS_XGB, 'XG Boost Optimized')]
title = 'Calibration Plot for Previously Computed Models'
plot_calibration(model, dev_data, dev_labels, title)
Naive Bayes Brier Score: 0.060
Nearest Neighbors Brier Score: 0.243
Random Forest Brier Score: 0.027
Support Vector Machine Brier Score: 0.256
XG Boost Brier Score: 0.025
XG Boost Optimized Brier Score: 0.022

Drawing

Because of the very different proportion of churn data between the train and dev dataset, our calibration curves for all models performs very poorly. We therefore need to run a calibration model (trained on the dev set - which hasn't been used for training and provides the correct proportion of churn). Training using the dev set, we can then determine calibration efficiency by looking at the test set which until now has not been utilized. Due to the limited size of our dataset, we will train on the dev set, however an additional subset of data that hasn't previously been used would be a preferred approach and is recommended for future development.

We can see the affect of different proportion churn datsets for train and test in the histogram underneath the calibration curve. We know that in the dev set 94% of the data is labeled 0 (so should have a small probability) however there are peaks around the 50% mark especially in the SVM model. This distorts the probability computation and causes the models to be poorly calibrated.

We will start by training using the built in sklearn packages which utilizes Platt's scaling (fitting a logistic regression model) or isotonic calibration procedures.

#Fit isotonic and sigmoid calibration to the XG Boost Model
clf_isotonic = CalibratedClassifierCV(clf_GS_XGB, cv = 2, method = 'isotonic')
clf_isotonic.fit(dev_data, dev_labels)

clf_sigmoid = CalibratedClassifierCV(clf_GS_XGB, cv = 2, method = 'sigmoid')
clf_sigmoid.fit(dev_data, dev_labels)
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\AOlson\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:





CalibratedClassifierCV(base_estimator=GridSearchCV(cv=None, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jo....01], 'reg_alpha': [0.1]}],
       pre_dispatch='2*n_jobs', refit=True, scoring='recall', verbose=0),
            cv=2, method='sigmoid')

The sklearn calibration functions utilize the dataset in raw form (rather than feeding in predicted probabilities from the prior model). Below when we implement our own model, we train on the former predicted model probabilities (clf_GS_XGB) to compute the calibrated model.

model =  [(clf_GS_XGB, 'XG Boost'),
                 (clf_isotonic, 'XGB Isotonic Calibration'),
                 (clf_sigmoid, 'XGB Sigmoid Calibration')]
title = 'Calibration Plot for Isotonic and Sigmoid Calibration Functions'
plot_calibration(model, test_data, test_labels, title)
XG Boost Brier Score: 0.021
XGB Isotonic Calibration Brier Score: 0.013
XGB Sigmoid Calibration Brier Score: 0.015

Drawing

We can see that both calibration procedures significantly improved over the non-calibrated XG-Boost method (we are calibrating on XG Boost because this provided the highest accuracy and recall). The isotonic calibration worked better than sigmoid.

From the histogram plot we can see that most predictions have a low probability, which makes sense based on the skew in the dev and test datasets (only 6% churn).

We then wanted to test other methods of calibration to determine if our own implementation would result in improved calibration.

#Initialize models used for calibration
clf_LR_class = LogisticRegression()
clf_RF_class = RandomForestClassifier()
clf_NB_class = GaussianNB()
clf_SVM_class = svm.SVC(kernel='rbf', probability = True)

#Get probability data from the former model (XG Boost) in order to fit the calibration model
probs_dev = clf_GS_XGB.predict_proba(dev_data)[:, 1]
probs_dev = probs_dev.reshape(-1, 1)
probs_test = clf_GS_XGB.predict_proba(test_data)[:,1]
probs_test = probs_test.reshape(-1,1)

model = [(clf_LR_class, 'Logistic Regression Calibration'),
                 (clf_RF_class, 'Random Forrest Calibration'),
                 (clf_NB_class, 'Naive Bayes Calibration'),
                 (clf_SVM_class, 'SVM Calibration')]
#Fit the models used for calibration with the probability output from XG Boost
for clf, name in model:
    clf.fit(probs_dev, dev_labels)
title = 'Calibration Plot for Implementation of Calibrated Models using other ML Models'
plot_calibration(model, probs_test, test_labels, title)
Logistic Regression Calibration Brier Score: 0.013
Random Forrest Calibration Brier Score: 0.017
Naive Bayes Calibration Brier Score: 0.020
SVM Calibration Brier Score: 0.011

Drawing

Here we can see that our implementation, using the default hyperparameters for the sklearn modeling packages, improved the baseline accuracy (clf_GS_XGB). Additionally, while visually all lines appear to be less calibrated than the former calibration steps (isotonic and sigmoid), the brier score for the SVM implementation was actually lower (better) than the isotonic implementation. The dataset here was limited (5000 samples) and visually isotonic does appear to follow the 45 deg line more closely than SVM. We can see that in several locations SVM is very close to the 45 deg line which may help to bring down the brier score calculation on this limited dataset, but overall performance would be hindered.

Because the intent of the calibration is to feed an economic model that will analyze probability of churn over a range of probabilities, we will utilize the isotonic calibration model as it stays closs to the 45 deg line throughout the range of probabilities where we would like to recommend action taken to prevent churn.

Future work should involve larger datasets to validate the use of the isotonic calibration or support an alternative method.

We can now use our calibrated probabilities to feed our economic model and allow business insight into the problem of business churn.

5. Economic Impact

Economic model to plug into the business plan

To this point we have the following information:

*Users who could churn (from the model)
*Probability for churn (from the calibrated model)
*Spending metrics (from the data)

Next step would be to guide the business on worthwhile spending to keep the customers at risk. We'll come up with a model for this spend (marketing spend) and then apply it to our data. The marketing spend can be used for loyalty programs, incentives or other kinds of tiered discounting programs. We'll keep the spend formfactor out of scope for this report.

For the economic analysis, data has been provided in NTD (New Taiwan Dollars). This will be the currency used throughout this analysis.

Coming up with the economic model for retention

We'll start with some metrics for the customer and business that are visible from the data or our feature list

* Optimum lifetime value of a customer is assumed to be the revenue from the highest paying customer (for the purpose of this report). We calculate 2 metrics for this: 
    * Optimum lifetime value -OLTV- max paid/day from our sample
    * Optimum lietime value (3 years) - OLTV3y

* Lifetime value of a customer is the actual revenue from the customer. We again consider 2 metrics for this:
    * Life time value - LTV - actuals from the customer/day
    * Life time value (3 years) - LTV3y

* Average lifetime value of a customer is the average what's paid by our sample customers. The 2 metrics for this are:
    * Average life time value - ALTV - average/day
    * Average life time value (3 years) - altv3y

The Average 3 year life value of a customer can help us make assumptions on what can be spend to acuire the customer in the first place (CAC). We're assuming that this value is 10% of the average lifetime value *CAC = .1 * altv3y

Now know the value (revenue) of our customers and what it cost to acquire them. We'll move on to find out the spend to keep them. CAC is relevant as it helps us establish a ceiling for our retention/reacquisition cost (RAC) spend. We'll assert that the reacquistion cost for the customer cannot exceed 75% of the original acquisition cost.

We'd not want to spend the "reacquistion $" equally on all customers. We'd want to optimize this spend based on the forllowing:

* Value of the customer. ltv/oltv (lifetime value of a customer/lifetime value of our optimum customer) is a good representation of a customer value. It simplistic as leaves out all intangibles like social and support cost impact of some customers. We can easily extend our model for those factors

* Risk of flight. The probability of churn is a good represention of this

Combining all of the above, we arrive at the following model for reacqusition spending (RAC)

** RAC = .75 * CAC * POC * (LTV/OLTV)**

We'll calculate this value individually for each customer.

An enchancement to our recommendation can be finding RAC clusters to suggest tiers of spending. We've not attempted that in this report.

Probabilities from our model

# Get the probabilities for economic model
# We'll use the calibrated model

# Prediction probabilities
predictions_prob = clf_isotonic.predict_proba(dev_data)

# Predictions
predictions = clf_isotonic.predict(dev_data)

# Test

print ('''

Dev data shape: {}
Predictions shape: {}
Sample of predictions: {}
Prediction probabilities shape: {}
Sample of prediction probabilities: 
{}

'''.format(dev_data.shape, predictions.shape, predictions[:5], predictions_prob.shape ,predictions_prob[:5])
      )
Dev data shape: (11681, 159)
Predictions shape: (11681,)
Sample of predictions: [0 0 0 0 0]
Prediction probabilities shape: (11681, 2)
Sample of prediction probabilities: 
[[9.91e-01 8.57e-03]
 [9.98e-01 1.70e-03]
 [9.99e-01 1.18e-03]
 [1.00e+00 0.00e+00]
 [1.00e+00 2.04e-04]]
# Starting the marketing data frame

marketing_data = dev_data.copy(deep = True)
# Test
print('''
Marketing_data shape: {}

'''.format(marketing_data.shape)
     )
Marketing_data shape: (11681, 159)
# Adding predictions and probability (From the model)

marketing_data['predictions'] = predictions
marketing_data['probability_churn'] = predictions_prob[:,1]

Lifetime value metrics for the customers

# Optimum life time value (per day ) can be represented by the max of "amount paid per day" among our sample

oltv = marketing_data['amount_paid_per_day'].max()
oltv3y = oltv * 365 * 3

print('''
Optimum lifetime value of the customer per day (NTD/day): NT${:,.2f}
Optimum lifetime value of the customer (3 Years) (NTD): NT${:,.2f}

'''.format(oltv, oltv3y)
     )
Optimum lifetime value of the customer per day (NTD/day): NT$13.45
Optimum lifetime value of the customer (3 Years) (NTD): NT$14,727.75
# Average lifetime value (ALTV) can be represented by spend/day 
# Over 3 years it helps us calculate the cost to acquire the customer (CAC)
# The business plan allows CAC to be 10% of revenue from the customer

altv = marketing_data['amount_paid_per_day'].mean()
altv3y = marketing_data['amount_paid_per_day'].mean() * 365 * 3
cac = .1 * altv3y

print('''
Average lifetime value of the customer per day (NTD/day): NT${:,.4f}
Average lifetime value of the customer: (3 Years) (NTD): NT${:,.4f}
Customer acquisition cost (NTD): NT${:,.4f}

'''.format(altv, altv3y, cac)
     )
Average lifetime value of the customer per day (NTD/day): NT$4.5435
Average lifetime value of the customer: (3 Years) (NTD): NT$4,975.1425
Customer acquisition cost (NTD): NT$497.5143

Adding the reacquisition cost suggestion to our model

RAC = .75 * CAC * POC * (LTV/OLTV)

# Reacquistion cost (marketing spend to prevent churn):
# RAC = .75 * CAC * POC * (LTV/OLTV)

print('''

Shape of predictions: {}
Shape of predictions_prob: {}
Shape of marketing_data['amount_paid_per_day']: {}

'''.format(predictions_prob[:,1].shape, predictions.shape, marketing_data['amount_paid_per_day'].shape)
     )
marketing_data['proposed_spend'] = ( .75 * cac
                                    * predictions_prob[:,1]
                                    * predictions
                                    * (marketing_data['amount_paid_per_day']/oltv)
                                    )
Shape of predictions: (11681,)
Shape of predictions_prob: (11681,)
Shape of marketing_data['amount_paid_per_day']: (11681,)
# Test
print(marketing_data.shape)
marketing_data.head()
(11681, 162)
city bd registered_via registration_init_time date_featuresdatelistening_tenure within_days_1num_unqsum within_days_1num_unqmean within_days_1num_unqcount within_days_1total_secssum within_days_1total_secsmean within_days_1num_25sum within_days_1num_25mean within_days_1num_50sum within_days_1num_50mean within_days_1num_75sum within_days_1num_75mean within_days_1num_985sum within_days_1num_985mean within_days_1num_100sum within_days_1num_100mean within_days_7num_unqsum within_days_7num_unqmean within_days_7num_unqcount within_days_7total_secssum within_days_7total_secsmean within_days_7num_25sum within_days_7num_25mean within_days_7num_50sum within_days_7num_50mean within_days_7num_75sum within_days_7num_75mean within_days_7num_985sum within_days_7num_985mean within_days_7num_100sum within_days_7num_100mean within_days_14num_unqsum within_days_14num_unqmean within_days_14num_unqcount within_days_14total_secssum within_days_14total_secsmean within_days_14num_25sum within_days_14num_25mean within_days_14num_50sum within_days_14num_50mean within_days_14num_75sum within_days_14num_75mean within_days_14num_985sum within_days_14num_985mean within_days_14num_100sum within_days_14num_100mean within_days_31num_unqsum within_days_31num_unqmean within_days_31num_unqcount within_days_31total_secssum within_days_31total_secsmean within_days_31num_25sum within_days_31num_25mean within_days_31num_50sum within_days_31num_50mean within_days_31num_75sum within_days_31num_75mean within_days_31num_985sum within_days_31num_985mean within_days_31num_100sum within_days_31num_100mean within_days_90num_unqsum within_days_90num_unqmean within_days_90num_unqcount within_days_90total_secssum within_days_90total_secsmean within_days_90num_25sum within_days_90num_25mean within_days_90num_50sum within_days_90num_50mean within_days_90num_75sum within_days_90num_75mean within_days_90num_985sum within_days_90num_985mean within_days_90num_100sum within_days_90num_100mean within_days_180num_unqsum within_days_180num_unqmean within_days_180num_unqcount within_days_180total_secssum within_days_180total_secsmean within_days_180num_25sum within_days_180num_25mean within_days_180num_50sum within_days_180num_50mean within_days_180num_75sum within_days_180num_75mean within_days_180num_985sum within_days_180num_985mean within_days_180num_100sum within_days_180num_100mean within_days_365num_unqsum within_days_365num_unqmean within_days_365num_unqcount within_days_365total_secssum within_days_365total_secsmean within_days_365num_25sum within_days_365num_25mean within_days_365num_50sum within_days_365num_50mean within_days_365num_75sum within_days_365num_75mean within_days_365num_985sum within_days_365num_985mean within_days_365num_100sum within_days_365num_100mean within_days_9999num_unqsum within_days_9999num_unqmean within_days_9999num_unqcount within_days_9999total_secssum within_days_9999total_secsmean within_days_9999num_25sum within_days_9999num_25mean within_days_9999num_50sum within_days_9999num_50mean within_days_9999num_75sum within_days_9999num_75mean within_days_9999num_985sum within_days_9999num_985mean within_days_9999num_100sum within_days_9999num_100mean total_plan_days total_amount_paid amount_paid_per_day diff_renewal_duration diff_plan_amount_paid_per_day latest_payment_method_id latest_plan_days latest_plan_list_price latest_amount_paid latest_auto_renew latest_is_cancel latest_amount_paid_per_day date_featuresdatemax_date_year date_featuresdatemax_date_month date_featuresdatemax_date_day date_featuresdatemax_date_absday date_featuresdatemin_date_year date_featuresdatemin_date_month date_featuresdatemin_date_day date_featuresdatemin_date_absday female male latest_transaction_date_year latest_transaction_date_month latest_transaction_date_day latest_transaction_date_absday latest_expire_date_year latest_expire_date_month latest_expire_date_day latest_expire_date_absday latest_trans_vs_expire latest_trans_vs_log latest_log_vs_expire registration_time predictions probability_churn proposed_spend
msno
x5+AtzOZKAtnAJBsCIdAyiRl+1p9nIvAYchIkS4zaS4= 22 31 9 20150202 748 2 2 1 330.174 330.174 1 1 0 0 0 0 0 0 1 1 179 44.750000 4 41022.435 10255.608750 31 7.750000 4 1.000000 0 0.000000 6 1.500000 153 38.250000 193 38.600000 5 43158.296 8631.659200 34 6.800000 11 2.200000 0 0.000000 6 1.200000 158 31.600000 275 22.916667 12 59410.879 4950.906583 52 4.333333 18 1.500000 7 0.583333 7 0.583333 214 17.833333 1687 33.740000 50 372126.228 7442.524560 371 7.420000 79 1.580000 52 1.040000 50 1.000000 1303 26.060000 4329 36.686441 118 997354.598 8452.157610 719 6.093220 137 1.161017 92 0.779661 118 1.000000 3595 30.466102 8460 35.696203 237 1947010.556 8215.234414 1507 6.358650 272 1.147679 155 0.654008 171 0.721519 7136 30.109705 21497 39.883117 539 5178191.657 9607.034614 3037 5.634508 686 1.272727 365 0.677180 367 0.680891 19132 35.495362 555 2682 4.832432 0 0.0 27 30 149 149 1 0 4.966667 2017 2 19 6259 2015 2 2 5511 0 1 2017 2 1 6241 2017 2 28 6268 -27 -18 -9 2015 0 0.008569 0.0
WiVvUGUuxmRviEX69svzHUC/zhpyJZdAm3ZyExXsjHA= 17 40 9 20071006 787 46 46 1 11779.310 11779.310 1 1 0 0 0 0 0 0 46 46 105 26.250000 4 26890.522 6722.630500 2 0.500000 0 0.000000 2 0.500000 0 0.000000 104 26.000000 145 20.714286 7 36302.890 5186.127143 7 1.000000 1 0.142857 2 0.285714 2 0.285714 137 19.571429 276 25.090909 11 55003.399 5000.309000 67 6.090909 29 2.636364 9 0.818182 6 0.545455 188 17.090909 1425 24.568966 58 322053.197 5552.641328 275 4.741379 170 2.931034 74 1.275862 40 0.689655 1101 18.982759 2771 23.091667 120 596865.459 4973.878825 617 5.141667 347 2.891667 140 1.166667 110 0.916667 2002 16.683333 6125 26.864035 228 1388082.365 6088.080548 1567 6.872807 538 2.359649 255 1.118421 229 1.004386 4862 21.324561 26479 43.408197 610 6661356.419 10920.256425 5362 8.790164 2650 4.344262 1323 2.168852 1318 2.160656 23024 37.744262 543 2831 5.213628 0 0.0 39 30 149 149 1 0 4.966667 2017 2 27 6267 2015 1 2 5480 1 0 2017 1 31 6240 2017 3 12 6280 -40 -27 -13 2007 0 0.001696 0.0
ur0rGRoV2XJOYpNbzl5n/jBEV9PrKDwZX4QeO03gXl8= 6 21 4 20160822 189 36 36 1 9240.939 9240.939 1 1 0 0 0 0 0 0 35 35 371 61.833333 6 107030.392 17838.398667 7 1.166667 3 0.500000 0 0.000000 1 0.166667 415 69.166667 756 68.727273 11 220080.304 20007.300364 58 5.272727 27 2.454545 5 0.454545 7 0.636364 843 76.636364 1766 65.407407 27 590614.385 21874.606852 138 5.111111 74 2.740741 15 0.555556 16 0.592593 2275 84.259259 5486 66.096386 83 1899764.373 22888.727386 493 5.939759 116 1.397590 44 0.530120 41 0.493976 7351 88.566265 11591 68.994048 168 3408784.787 20290.385637 1131 6.732143 198 1.178571 90 0.535714 160 0.952381 13134 78.178571 12290 69.044944 178 3576194.197 20090.978635 1290 7.247191 205 1.151685 99 0.556180 178 1.000000 13798 77.516854 12290 69.044944 178 3576194.197 20090.978635 1290 7.247191 205 1.151685 99 0.556180 178 1.000000 13798 77.516854 210 1043 4.966667 0 0.0 39 30 149 149 1 0 4.966667 2017 2 27 6267 2016 8 22 6078 1 0 2017 1 31 6240 2017 3 26 6294 -54 -27 -27 2016 0 0.001179 0.0
M/PccoJW/A9myX+eCodcY8Z4LMD1r+d6YKzUNv4PMZo= 1 0 7 20130610 789 15 15 1 3416.639 3416.639 0 0 0 0 0 0 0 0 16 16 204 29.142857 7 88005.151 12572.164429 5 0.714286 3 0.428571 2 0.285714 2 0.285714 395 56.428571 441 31.500000 14 168815.462 12058.247286 52 3.714286 31 2.214286 20 1.428571 6 0.428571 735 52.500000 781 26.033333 30 303585.197 10119.506567 118 3.933333 91 3.033333 41 1.366667 10 0.333333 1282 42.733333 1984 27.178082 73 581820.141 7970.138918 217 2.972603 178 2.438356 69 0.945205 24 0.328767 2343 32.095890 3501 25.554745 137 932913.407 6809.586912 373 2.722628 282 2.058394 106 0.773723 50 0.364964 3661 26.722628 7250 26.459854 274 1818035.906 6635.167540 768 2.802920 504 1.839416 196 0.715328 116 0.423358 7089 25.872263 17809 27.440678 649 4533034.736 6984.645202 1352 2.083205 937 1.443760 336 0.517720 198 0.305085 17432 26.859784 840 3631 4.322619 0 0.0 41 30 99 99 1 0 3.300000 2017 2 28 6268 2015 1 1 5479 0 0 2017 2 24 6264 2017 3 25 6293 -29 -4 -25 2013 0 0.000000 0.0
BZbN3U+ghA0lwOA34yF/GNHbJb73T48nEZGHc4bcikc= 9 31 9 20080410 788 25 25 1 13113.138 13113.138 14 14 1 1 5 5 3 3 93 93 101 16.833333 6 33808.385 5634.730833 29 4.833333 2 0.333333 12 2.000000 6 1.000000 165 27.500000 157 13.083333 12 48485.571 4040.464250 40 3.333333 4 0.333333 14 1.166667 9 0.750000 219 18.250000 484 21.043478 23 139573.866 6068.428957 97 4.217391 21 0.913043 23 1.000000 20 0.869565 554 24.086957 1416 22.838710 62 380409.666 6135.639774 213 3.435484 51 0.822581 50 0.806452 59 0.951613 1583 25.532258 4484 30.093960 149 1179486.041 7916.013698 791 5.308725 162 1.087248 143 0.959732 259 1.738255 4752 31.892617 8142 26.607843 306 2174747.697 7107.018618 1684 5.503268 343 1.120915 312 1.019608 519 1.696078 8373 27.362745 12775 20.942623 610 3287076.077 5388.649307 2773 4.545902 602 0.986885 543 0.890164 926 1.518033 12292 20.150820 480 3278 6.829167 0 0.0 34 30 149 149 1 0 4.966667 2017 2 27 6267 2015 1 1 5479 1 0 2017 2 28 6268 2017 3 31 6299 -31 1 -32 2008 0 0.000204 0.0

Validation

We're working out the following to make sure that the nubmers for the reacquistion marketing spend make sense in the context of our business: * Total marketing spend * Marketing spend per customer (all customers) * Marketing spend per customer with probability of churn (indicated as churn by the model) * Comparison with the earnings from the sample

# Validation
# What is the total proposed marketing spend? 
tpms = marketing_data['proposed_spend'].sum()
# Averaging (per user)
tpmsa = tpms/len(predictions)
# Averaging (for users marked for churn)
ch = np.count_nonzero(predictions)
tpmsac = tpms/ch

print('''
Total proposed marketing spend (NTD): NT${:,.0f}
Total number of users in the sample: {:d}
Average Reacquisition/retention marketing spend/user (NTD): NT${:,.4f}
Total number of users projected to churn (any probability): {:d}
Average Reacquisition/rention marketing spend for users marked to churn (NTD): NT${:,.4f}

''' .format(tpms, len(predictions), tpmsa, ch,  tpmsac)
     )
Total proposed marketing spend (NTD): NT$77,025
Total number of users in the sample: 11681
Average Reacquisition/retention marketing spend/user (NTD): NT$6.5940
Total number of users projected to churn (any probability): 705
Average Reacquisition/rention marketing spend for users marked to churn (NTD): NT$109.2551
# Validation
# How does the marketing spend compare to the projected earnings from this sample

# Ratio of marketing(per user) to topline for this sample
rms = (tpms/len(predictions))/altv3y

print('''
Projected 3 year earnings from users in dev data (per user) (NTD): NT${:,.4f}
Ratio of Reacquistion/Retention marketing to topline from users in dev data: {:.4f}

'''.format(altv3y, rms)
     )
Projected 3 year earnings from users in dev data (per user) (NTD): NT$4,975.1425
Ratio of Reacquistion/Retention marketing to topline from users in dev data: 0.0013

Visual for marketing spend - amoung spent/day by all the customers

# Marketing spend - amount spent/day by the users
# plt.scatter(marketing_data['amount_paid_per_day'], marketing_data['proposed_spend'], s=4,
           #c=cm.hot(marketing_data['proposed_spend']))

plt.xlabel('Amount paid per day by a customer (NTD)')
plt.ylabel('Proposed retention marketing spend per customer (NTD)')
plt.title('Retention marketing spend - Amount paid per day by a customer')    
plt.scatter(marketing_data['amount_paid_per_day'], marketing_data['proposed_spend'], s=3)
plt.show()

Drawing

The plot above looks at proposed retention spend by customer against the amount paid per day by customer. Some scatter exists in the plot showing that not all members at a specific NTD/day price level are equal in terms of risk of churn. The proposed retention spend is calculated over the three year (assumed lifetime) of the customer (so at a level of 100 NTD, the proposed incentive per day would be ~0.09 NTD).

Visual for marketing spend in the context of the total number of customers

# Visual of spend in context of the number of users
plt.title('Proposed retention marketing spend spread among all customers')
plt.xlabel('Proposed retention marketing spend per customer')
plt.ylabel('Number of customers')
plt.hist(marketing_data['proposed_spend'] )
plt.show()

Drawing

From the plot above, we can see that for most users, the proposed spend is 0, due to the predicted probability that they won't churn, and therefore no incentives should be offered. This makes economic sense as our dataset only contains 6% churn, and we cannot afford to offer incentives to too many members if their unlikely to churn.

We do see that there is some data for proposed spend between 50-150 which is consistent with the former scatter plot, however overwhemingly the proposed spend on a customer is 0 NTD (New Taiwan Dollar).

Visual for marketing spend per user

#Scatter Plot to Show Proposed Spend per User
sorted_marketing = marketing_data.sort_values(by='proposed_spend')
sorted_marketing = sorted_marketing[sorted_marketing['proposed_spend'] >0.0]['proposed_spend']
sorted_marketing = sorted_marketing.reset_index()
plt.scatter(sorted_marketing.index, sorted_marketing['proposed_spend'])
plt.xlabel('Member - re-indexed to integer for plotting purposes')
plt.ylabel('Maximum proposed spend on each user (NTD)')
plt.title('Sorted proposed spend on user over 3 year lifetime')
Text(0.5,1,'Sorted proposed spend on user over 3 year lifetime')

Drawing

The plot above looks at the sorted proposed spend for users. The plot has been sorted in order to determine if there are any insights that can be made by looking at proposed spend levels. We can see near the 85 NTD (New Taiwan Dollar) mark there is a slight plateau, which could indicate a good level of incentive to offer users.

Beyond the slight plateau the maximum proposed spend appears to increase approximately linearly over the range of 75 to 130 NTD. Indication from this plot shows that users who's proposed spend is >130 may not be worthwhile pursuing due to the high cost of incentives.

The proposed spend is the maximum spend allocated over the customer lifetime (assumed 3 years) and would be distributed through discounts and incentives. It may be possible to prevent churn by offering less than this maximum spend amount, which is an opportunity for a future ML model.

# Recap of key metrics

print('''
------> Key metrics from the model:

Total number of users in the sample: {}
Total number of users projected to churn (any probability): {}

Total proposed marketing spend: NT${:,.0f}
Total revenue (3 year) from all users in the sample: NT${:,.0f}
Total revenue (3 year) from users at risk in the sample: NT${:,.0f}

Average Reacquisition/retention marketing spend/user: {}
Average Reacquisition/retention marketing spend for users marked to churn: {}

Ratio of Reacquistion marketing spend to revenue from all users: {:.4f}
Ratio of Reacquistion marketing spend to reveuene from at risk users: {:.4f}

''' .format(len(predictions), ch, int(tpms), int(altv3y*len(predictions)),
            int(altv3y*ch), int(tpmsa), int(tpmsac),
           tpms/(int(altv3y*len(predictions))),
           tpms/int(altv3y*ch) )
     )
------> Key metrics from the model:

Total number of users in the sample: 11681
Total number of users projected to churn (any probability): 705

Total proposed marketing spend: NT$77,024
Total revenue (3 year) from all users in the sample: NT$58,114,639
Total revenue (3 year) from users at risk in the sample: NT$3,507,475

Average Reacquisition/retention marketing spend/user: 6
Average Reacquisition/retention marketing spend for users marked to churn: 109

Ratio of Reacquistion marketing spend to revenue from all users: 0.0013
Ratio of Reacquistion marketing spend to reveuene from at risk users: 0.0220

Economic Impact Summary

We're able to create a very usable model for the retention/reacquistion effort in the business. In summary, of our NT\$58M 3-year revenue from our customer base, we estimate we will lose NT\$3.5M to churn. Accounting for the probability that users will actually churn, the team recommends spending NT\$77K on trying to retain these users. In the marketing data, the NT\$77K is broken down by user, however, we would recommend a discount program with a few tiers, as opposed to custom offers for every user.

The marketing data base keeps most original customer parameters and adds the following for use in business planning:

  • predictions - Will the customer churn (1 will churn)
  • predctions_prob - What's the probability of the customer leaving (Higher is bad)
  • proposed_spend - We can spend up to this value to keep this customer

Our calculations (mostly for validation in the report) show that the retention marketing spend is inline with the revenue opportunity for the business. Key metrics are calculated again above.

One caveat to our assertion is effectiveness of reacquistion marketing spend. Feedback on that could influence the value of our spend as we put the model into production.

Another thing to note in our model is that we're proposing a spend for every customer at risk. Reality may be that we cannot save customers above 80% probability of churn. Should we invest in these users?

6. Final Insights and Takeaways

Here are our insights and commentary from our analysis:

Predictive Modeling Summary and Future Considerations

As discussed in the modeling section, the final model had a recall performance of 98.3% (on the test data), correctly predicting almost every user who actually churned (408 of 415 users), and also correctly predicting amost every user who would actually stay (6,444 of 6,609 users).

Given the strong performance of the current model, any future enhancements would need to balance the cost of making those enhancements against the potential benefits of retaining a few more users. That said, the team thinks the best place to find additional gains in the model is by engineering additional features from the users, potentially from reviewing user listening habits such as music genre.

Economic Impact

We created and validated a model to guide retention program spending for customers at the risk of churn (RAC = .75 * CAC * POC * (LTV/OLTV) where RAC is the Reacquistion cost or retention spending/customer, CAC is the cost of customer acquistion, LTV is the lifetime value of a customer and OLTV is the optimum lifetime value (the revenue from our best customer).

The model works with a spend ceiling (Customer acquisiton cost) that's about 10% of the lifetime value of an average customer. We stay below this limit by reducing the amount by value of the customer and her probability of churn.

A sample run with our development data shows a total marketing spend between 77K to protect ~3.5 M of revenue. The model shows how this amount can be divvied up among the customers (in the marketing data frame).

A production version of this model can guide the business on both the candidates and amount for retention spending among the customer base.

Productization and Future Development

Implementation as a pipeline, performance testing and feedback loop

Our assessment is that at least the following would have to be done to put the model into production:

* Implementation as a pipeline: The model is quite efficient and can be implemented offline or a near realtime implementation
    *For a offline implementation, data can be picked up from backup databases and pumped into a system like HDFS. Calculations on predictions can then be done by compute on Hadoop or an application like SPARK   
    *A new realtime pipeline can be built by duplicating usage and transaction events to a KAFKA--->SPARK--->HDFS pipeline for writing and then batch processing similar to above

*Performance testing: Further performance testing is recommended for both an offline or realtime pipeline. Our testing was with a small subset on a laptop and may not reflect the needs of the enterprise

*Feedback loop: A feedback loop and experiments for following could increase the efficacy of the model
    *Predictions: Feedback on predictions can be implemented by having a small control group where we do not use our retention methods
    *Retention spending: Feedback on our retention methods can directly come from the group on which we are using the methods. In the simplest case we'll just have to find out if the customer stayed with the service
    *Models and calibration: Testing and calibration would be needed about once in a Q

The goal of above would be to maintain or improve the performance vs accuracy/recall for the model while rolling it out to growing customer cohorts. From this analysis, we hope to provide insight into customers that have a potential to churn, and recommend incentives for those customers to prevent churn. We believe that the future work listed above will help to make a significant impact in business operations which will generate more revenue and profit for the company.

An additional topic that could be explored in the future is understanding at what incentive level a user won't churn. Here we present the maximum incentive spend (over a 3 year span) for a customer, however from a business sens we would like to offer the minimum incentive to prevent churn. This model can build on what has alredy been developed here and will require additional data to capture incentives that have previously been offered and the outcome of the offering.

Initial Data Filter and QC

import google.datalab.bigquery as bq
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Pasted here is the code that was initially ran. As mentioned in the report, our dataset is large (~31GB combined tables) and couldn't all be ran on a single computer. For the sake of this analysis, we have filtered the data down to a mangageable size (~1.5GB). Additionally, we saw initially that our dataset contained only 6% churn. In order to help improve our model, we wanted to have a filtered dataset that contained a 50/50 split between churn and not churn.

The methodology to get this data into a useable format was the following: - Upload data into google cloud storage - Utilize Google BigQuery to run SQL statement against the datasets - Export the datasets as CSV files to be managed locally

For our dataset we wanted to have approximately 100k members (total dataset size is ~993k members so approximately 10% of our data). Utilizing a 50/50 split we therefore wantedk 50k members who did churn and 50k members who didn't churn. In order to filter the tables we first ran a query to get 50k members who did churn in the labels table:

SELECT * FROM [w207_kkbox_bq_data.labels] WHERE (RAND(5) < 50000/(SELECT COUNT(*) FROM [w207_kkbox_bq_data.labels] WHERE is_churn = 0) AND is_churn = 0)
SELECT * FROM [w207_kkbox_bq_data.labels] WHERE (RAND(5) < 50000/(SELECT COUNT(*) FROM [w207_kkbox_bq_data.labels] WHERE is_churn = 0) AND is_churn = 1)

These two tables were written to cloud storage and provided the baseline of the members that we would keep when querying other tables. The labels table contains member ID as well as the is_churn (dependent) variable. We now utilize these two tables to query the other datasets joining on the members ID [we kept these two tables separate and then combined the is_churn = 0 and is_churn = 1 into a combined table at the end, in hindsight it would have been more efficient to combine the labels table first].

Members Table:

SELECT members.*  FROM [w207_kkbox_bq_data.members] as members INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_1] as lab ON members.msno = lab.msno
SELECT members.*  FROM [w207_kkbox_bq_data.members] as members INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_0] as lab ON members.msno = lab.msno

Transactions Table:

SELECT transactions.*  FROM [w207_kkbox_bq_data.transactions] as transactions INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_0] as lab ON transactions.msno = lab.msno
SELECT transactions.*  FROM [w207_kkbox_bq_data.transactions] as transactions INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_1] as lab ON transactions.msno = lab.msno

User_Logs Table:

    SELECT user_logs.*  FROM [w207_kkbox_bq_data.user_logs] as user_logs INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_1] as lab ON user_logs.msno = lab.msno
    SELECT user_logs.*  FROM [w207_kkbox_bq_data.user_logs] as user_logs INNER JOIN [w207_kkbox_bq_data.labels_filtered_50k_churn_0] as lab ON user_logs.msno = lab.msno

With this methodology there was two tables (is_churn = 0 and is_churn = 1) for each table in the original dataset. We then combined/appended the two tables back into a singular table and exported to csv (labels_filtered.csv, members_filtered.csv, transactions_filtered.csv, user_logs_filtered.csv). From these local files we then began development towards predcting churn.

While we moved ahead with this filtered dataset, we also wanted to perform quick EDA on the entire dataset to the filtered dataset, to compare and contrast some of the features in the dataset. We understand that we artificially inflated the is_churn = 1 statistic for purposes of model training, therefore there will be some changes between the filtered and original dataset.

%%bq query --name is_churn_large
SELECT SUM(is_churn) / COUNT(is_churn)
FROM `w207_kkbox_bq_data.labels` AS labels
is_churn_large.execute().result()
f0_
0.06392287077349786


(rows: 1, time: 0.3s, cached, job: job_WFvTJUNTiVU0lg8eZM7HiKzbFhYR)

%%bq query --name is_churn_small
SELECT SUM(is_churn) / COUNT(is_churn)
FROM `w207_kkbox_bq_data.labels_filtered_100k` AS labels
is_churn_small.execute().result()
f0_
0.5005559729526672


(rows: 1, time: 0.1s, cached, job: job_6qccT3EvuDYAbLt8e7bMXy_rQuJC)

Here we look at the proportion of is_churn data. We can see that we manipulated the churn percentage in the filtered dataset. This was intentional in order to create an approximate even split between is churn and is_not_churn to benefit the model training and increase the number of examples of is_churn (since our original dataset only contains 6% is_churn).

%%bq query --name cities
SELECT city, COUNT(msno) AS population
FROM `w207_kkbox_bq_data.members` 
GROUP BY city
cities.execute().result()
citypopulation
14804326
5385069
947639
6135200
4246848
13320978
22210407
1489940
845975
15190213
1727772
1032482
1147489
1838039
1266843
165092
2130837
711610
327282
191199
204233


(rows: 21, time: 0.1s, cached, job: job_VRruJOzjmzBAePlRp8ei86w2l3uB)

%chart columns --data cities --fields city,population

Drawing

%%bq query --name cities
SELECT members_city AS city, COUNT(members_msno) AS population
FROM `w207_kkbox_bq_data.members_filtered_100k` 
GROUP BY city
cities.execute().result()
citypopulation
140639
10763
45670
58469
1311012
142273
155032
8937
121395
224825
7290
63113
21657
17581
111021
16118
91107
18900
3584
2076
1911


(rows: 21, time: 0.2s, cached, job: job_gV83pX9lLCc3jSdGZhB1sEsQFWW0)

%chart columns --data cities --fields city,population

Drawing

We can see that there is one city (labeled city = 1 that uses kkbox service far more than any other city accross the geographical area that kkbox is used. Comparing the two datasets, we see that overall city one has a higher proportion of the total members in the original dataset compared to the filtered.

%%bq query --name city_renew
SELECT members.city AS city, CAST(SUM(transactions.is_auto_renew) / COUNT(members.city) AS FLOAT64) AS renew_by_city
FROM `w207_kkbox_bq_data.members` AS members
INNER JOIN `w207_kkbox_bq_data.transactions` AS transactions
ON members.msno = transactions.msno
GROUP BY members.city
ORDER BY renew_by_city DESC
%chart columns --data city_renew --fields city,renew_by_city

Drawing

%%bq query --name city_renew
SELECT members.members_city AS city, CAST(SUM(transactions.transactions_is_auto_renew) / COUNT(members.members_city) AS FLOAT64) AS renew_by_city
FROM `w207_kkbox_bq_data.members_filtered_100k` AS members
INNER JOIN `w207_kkbox_bq_data.transactions_filtered_100k` AS transactions
ON members.members_msno = transactions.transactions_msno
GROUP BY city
ORDER BY renew_by_city DESC
%chart columns --data city_renew --fields city,renew_by_city

Drawing

In the plot above we analyze the proportion of auto-renew customers by city. From before we saw that city 1 had the highest count overall for users. Here we can see that it also has the highest proportion of auto-renew customers. Overall, it appears that most customers are on an auto-renew plan which is good from a business perspective.

Comparing the two dataframes there is similar structure to both bar charts, which means the datasets for this feature are consistent.

%%bq query --name gender_describe
SELECT COUNT(gender) AS gen FROM `w207_kkbox_bq_data.members` AS members GROUP BY gender
gender_describe.execute().result()
gen
1144613
1195355
0


(rows: 3, time: 0.2s, cached, job: job_AfNz0MGXT8OgGZx6EY3RMrmDKeOA)

%%bq query --name gender_describe
SELECT COUNT(members_gender) AS gen FROM `w207_kkbox_bq_data.members_filtered_100k` AS members GROUP BY members_gender
gender_describe.execute().result()
gen
24507
21630
0


(rows: 3, time: 0.1s, cached, job: job__0rEg7lP0hion9yO77VylbSMR6ij)

From the printout above, we can see that there is similarity amongst the gender distribution between the whole and filtered dataset, male has a slightly higher population than female.

Below we will now printout the description of the tables for comparison.

%%bq query --name describe_members
SELECT COUNT(DISTINCT(city)) AS city_count, COUNT(DISTINCT(registered_via)) AS registration_type, AVG(bd) AS average_bd
FROM `w207_kkbox_bq_data.members`
describe_members.execute().result()
city_countregistration_typeaverage_bd
21189.795794295951625


(rows: 1, time: 0.2s, cached, job: job_s_s-BUim86KEeQsaPWK8aI7ePUsl)

%%bq query --name describe_members
SELECT COUNT(DISTINCT(members_city)) AS city_count, COUNT(DISTINCT(members_registered_via)) AS registration_type, AVG(members_bd) AS average_bd
FROM `w207_kkbox_bq_data.members_filtered_100k`
describe_members.execute().result()
city_countregistration_typeaverage_bd
21514.915069350530343


(rows: 1, time: 0.2s, cached, job: job_RCIg6slHgd8nt-De7OuUfeub23gU)

We can see here that the filtered dataset doesn't capture all of the registration types (only 5 of the 18 total). Presumably there are a select number of very popular registration types, and several lesser used, which is why our data subset only contains a portion of the total registration types. Average birthday is higher (presumably younger if birthday is computed as days after a particular date) in our filtered dataset.

%%bq query --name describe_transactions
SELECT COUNT(DISTINCT(payment_method_id)) as payment_method, COUNT(DISTINCT(plan_list_price)) as num_plans, SUM(is_auto_renew) / COUNT(is_auto_renew) as prop_auto_renew, 
AVG(actual_amount_paid) AS plan_revenue, SUM(is_cancel) / COUNT(is_cancel) AS prop_cancel
FROM `w207_kkbox_bq_data.transactions`
describe_transactions.execute().result()
payment_methodnum_plansprop_auto_renewplan_revenueprop_cancel
40510.8519661406812573141.987320483545860.03976522648819046


(rows: 1, time: 0.1s, cached, job: job_w7nJR7nlHvuW0mk-jUsSP5EW_pu6)

%%bq query --name describe_transactions
SELECT COUNT(DISTINCT(transactions_payment_method_id)) as payment_method, COUNT(DISTINCT(transactions_plan_list_price)) as num_plans, SUM(transactions_is_auto_renew) / COUNT(transactions_is_auto_renew) as prop_auto_renew, 
AVG(transactions_actual_amount_paid) AS plan_revenue, SUM(transactions_is_cancel) / COUNT(transactions_is_cancel) AS prop_cancel
FROM `w207_kkbox_bq_data.transactions_filtered_100k`
describe_transactions.execute().result()
payment_methodnum_plansprop_auto_renewplan_revenueprop_cancel
37420.8315028382832431145.685304837455850.031064849396989492


(rows: 1, time: 0.1s, cached, job: job_HNnfHsxnSm7qJdbkItU1IJPU4-ej)

Comparing the filtered and total datasets, we can see that both payment method and number of plans decreases slightly in the filtered dataset, but not too much. One key highlight here is that moving to the full dataset will require fitting on the larger dataset, because there is data objects that have not been seen by the smaller model. The proportion of auto renew members is very similar across both datasets as is the revenue from the plan (average plan price). We can see the slightly lower members cancel in the filtered dataset when compared to the total dataset, however both datasets have a low proportion of cancel (and from before, cancel is lower than is_churn which is an interesting observation). Additionally, as the filtered dataset contains 50% churn, having such a low (and even lower than the total dataset) churn statistic is surprising.

%%bq query --name describe_user_logs
SELECT SUM(total_secs) AS listening_time, SUM(num_unq) AS number_unique, AVG(date) AS average_date, SUM(num_100) AS sum_full_songs, SUM(num_25) AS sum_25per_songs
FROM `w207_kkbox_bq_data.user_logs`
describe_user_logs.execute().result()
listening_timenumber_uniqueaverage_datesum_full_songssum_25per_songs
-5.665342138557264e+201179854690320157392.77279009120458136132553501878


(rows: 1, time: 0.1s, cached, job: job_SJWC3sRNxUqMaP_eGiz7LB-SRhZk)

%%bq query --name describe_user_logs
SELECT SUM(user_logs_total_secs) AS listening_time, SUM(user_logs_num_unq) AS number_unique, AVG(user_logs_date) AS average_date, SUM(user_logs_num_100) AS sum_full_songs, SUM(user_logs_num_25) AS sum_25per_songs
FROM `w207_kkbox_bq_data.user_logs_filtered_100k`
describe_user_logs.execute().result()
listening_timenumber_uniqueaverage_datesum_full_songssum_25per_songs
-3.1202667405739975e+1971105585920158095.16186604719873498155565964


(rows: 1, time: 0.1s, cached, job: job_6Zclmg8gA3QsOm_qb04JlROqwHV9)

Comparing the two datasets, we can see that both have an average listening time of a negative value (which intuitiviely doesn't make sens). It will be important to better understand how this data is collected to properly quality check the column and values. Number of unique is higher in the entire dataset, which makes sense as there is more data. Average date is in the form YYYYMMDD so the average date for user logs is in 2015 (averageing a date as an integer produces an incorrect date - datalab doesn't handle dates the same way as big query so further analysis in this notebook was not carried out). The sum of full songs and 25% songs is also higher in the full dataset which again makes sense. It is interesting to see that the proportion to full_songs/25per_songs is approximately the same (4.71 in the full vs 4.63 in the filtered).

Comments