Blog: Customer Churn Behavior in the telecommunications sector
In our project we looked at customer churn behavior in telco contracts. The variables interesting for telecommunication companies to predict customers being at risk to churn should be identified, to hereafter make it possible for the company to design activities to retain the customer to the firm and prevent churn. This reveals great potential for companies. Developing strategies to retain their customers, their main asset, is a key aspect for success. If it works in the telco segment the strategy can be transferred to other industries and businesses. In this work a dataset of n=7034 customers in the afore mentioned sector and 21 variables that capture demographics and customer relationship history was analyzed.
Before we start to work with the data, we import the basic packages we worked in data camp with: numpy, pandas and math plot. Then we had to import the data we chose. The data “WA_Fn-UseC_-Telco-Customer-Churn.csv” was imported by pd.read_csv().
Generally, it is important to clean the data before working with them. Moreover you have to get an overview over the data you work with.
We started to lowercase all letters, because we found it easier to handle the data. Data.info() showed that the column “Total Charges” has the type object but should be a number. For that reason, we changed it into numeric using the pandas function pd.to_numeric(). A new view to data.info() shows us, that there were some null objects. Rows including these objects could be deleted by using dropna(). Data.drop([‘custumerID’]) deleted the column customer ID, because it has no relevance for our prediction.
Then we converted columns which include yes and no into one and zero. To do so we used an if loop. Some columns include “yes”, “no” and a third option which says no because it´s not an option for the costumer. We decided that everything which is not yes, is automatically no. For that reason, we defined 1 if each == “Yes” else 0 for each. Moreover, we convert gender female into one and male into zero. The function data.astype(‘float64’) transforms the type of these columns in float.
After cleaning the data we checked if there is a correlation between the churn column and the other by using the function pd.df.corr() which gives the standard correlation coefficient. It returns a DataFrame. Plotting it gives a good overview witch variable seems to be important.
For better understanding we wanted to compare these important variables from costumers who churned and costumers who not churned. That is why we splited the data in one dataframe including all churned costumers named ‘churn_yes’ and in one containing costumer who stayed (‘churn_no’).
The boxplot underlines the negative correlation between tenure and churn. Which in this case means that people who have been customers over a longer period churn unlikelier than new customers? As learned in the track we plotted the boxplot by using the mathplot packaged, which were imported earlier. We also plotted the positive correlation between MonthlyCharges and churn.
“Internet service” isn´t in the correlation matrix, because its type is object. To get an idea about the relevance we made a pie chart. The same goes for the Payment Method.
The function pd.get_dummies() convert categorial variables into indicator variables.
We decided to use the Random Forests to build a first model in Python. It is an “supervised learning algorithm […] Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.” (Navlani, DataCamp, 2018). First, we had to import all necessary packages. The variable we wanted to predict is the “Churn” column. That´s why we made a dataframe named “Y” only contain these dependent variables and a “X” data frame which contain all the other independent variables. Because there is no test dataset, we created one by splitting it. Than we trained our model on the training set. For that we had to create a Gaussian Classifier. Clf.fit(X_train, Y_train) finally train the model.
After training, we checked the accuracy using an actual predicted value. The Forrest has an accuracy of nearly 0.8 percent. At least a company wants to know which factors are important to the costumers, so we used clf.future_importance to get the features with the highest weight. For visualization we´ve learned that we can use a combination of mathplotlib and seaborn. Another method to evaluate the accuracy of a classification is to create a confusion matrix. The square on the top left shows how many customers are predicted to stay and really stay. The square on the bottom right gives the number of people who are predicted to churn and actual churn. The area on the top right contain costumers who are predicted to stay but churned. And the square on the bottom left reflect the number of customers who churned but at least stayed.
Then we tried the XGBRegressor using the package xgboost. T”he basic idea behind boosting algorithms is building a weak model, making conclusions about the various feature importance and parameters, and then using those conclusions to build a new, stronger model and capitalize on the misclassification error of the previous model and try to reduce it” (Pathak, 2018). In this case we used xgb.XGBRegressor.fit(X_train, Y_train) to train our model. The accuracy is a little bit higher than the last one. Again, we made a confusion matrix.
At last we tried to do a logistic regression. “Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables.” (Navlani, DataCamp, 2018). In this case we had to do a binary logistic regression, because the target variable has only two possible outcomes. Again we trained our model using the function logreg.fit(). After that we tested the accuracy of the model again. In this case we get an accuracy of nearly 0.81. The confusion matrix worked in the same way as described before.
Because the logistic regression is the model with the greatest accuracy, we took these results to compare the plots to the plots we made before.
The basic functions like considering data, cleaning data and using mathplot we knew from the tracks. Most time we had to google a lot to find the right method.
Our Model has an accuracy of 80%, the confusion Matrix shows a high success rate for determining the actual churn. While there are some false positives and negatives, the amount of correct classification of both customer churn and customer retention is good enough to be useful to a business. The factors that mostly impact the decision are tenure, monthly charges and total charges. The monthly charges are proportional to customer churn, which shows that the churn is higher the more expensive a contract is. The same phenomena show from the inverse proportional nature of the relationship of tenure and customer churn: The less money a person earns a month, the less money the person can spend on telco contracts.
Interestingly, the total charges do not follow the same logic: One would expect the total charges to be proportional to the customer churn, when in fact the opposite is the case. This proposes that people may not care as much about saving money in the long run.
We conclude, that especially customers with a high value contract (high monthly payments) should be treated to their satisfaction, otherwise they might lose the feeling that the contract is worth its money. Additional services and quick and easy communication are recommended. Furthermore, tailored offers to people with a low income should be designed, e.g. lower prices or rate payments. Additionally, especially in the beginning of the relationship customers need to be satisfied, because they are more likely to churn. Overall, a holistic service for customers at every stage need to be ensured and the value of well-trained the sales personal is underlined. Investments of the company to prevent churn can be recommended.
Challenging in this work was to get started and to find the way to solve our question, so identifying the next step to go. We had to go back and forth during our work. It helped us to visualize the data to get a feeling for tendencies and distributions. When we faced issues, we can say “google was our friend”. Even if not everything was understood, it helped to solve the problem. It has been a valuable experience to work deeply on a specific topic. Having the question finally solved and little success moments in between brought fun to that work. It showed what we are capable to do.
5.) Team Members
– Caro Engelhardt — Data Science Python
– Eileen Steinebach– Data Science Python
– Liv Weirauch — Data Science Python (LinkedIn: https://www.linkedin.com/in/liv-janicke-weirauch/)
– Paulina Partz — Data Science Python
Navlani, A. (2018, May 16th). DataCamp. Retrieved from https://www.datacamp.com/community/tutorials/random-forests-classifier-python
Navlani, A. (2018, September 7th). DataCamp. Retrieved from https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python
Pathak, M. (2018, July 10th). DataCamp. Retrieved from https://www.datacamp.com/community/tutorials/xgboost-in-python#what