Blog: Predicting Your Chances Of Surviving The Titanic Disaster With C# And ML.NET Machine Learning
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this article, I’ll show you how to analyze what sorts of people were likely to survive. I’m going to build a C# machine learning app with ML.NET and NET Core to predict which passengers survived the tragedy.
ML.NET is Microsoft’s new machine learning library. It can run linear regression, logistic classification, clustering, deep learning, and many other machine learning algorithms.
And NET Core is the Microsoft multi-platform NET Framework that runs on Windows, OS/X, and Linux. It’s the future of cross-platform NET development.
The first thing I need for my app is a data file with the Titanic passenger manifest and a label indicating which passengers survived the disaster. I will use the famous Kaggle Titanic Dataset which has data for 893 passengers.
The training data file looks like this:
It’s a CSV file with 12 columns of information:
- The passenger identifier
- The label column containing ‘1’ if the passenger survived and ‘0’ if the passenger perished
- The class of travel (1–3)
- The name of the passenger
- The gender of the passenger (‘male’ or ‘female’)
- The age of the passenger, or ‘0’ if the age is unknown
- The number of siblings and/or spouses aboard
- The number of parents and/or children aboard
- The ticket number
- The fare paid
- The cabin number
- The port in which the passenger embarked
The second column is the label: 0 means the passenger perished, and 1 means the passenger survived. All other columns are input data from the passenger manifest.
I will build a binary classification machine learning model that reads in all columns, and then makes a prediction for each passenger if he or she survived.
Let’s get started. Here’s how to set up a new console project in NET Core:
$ dotnet new console -o Titanic
$ cd Titanic
Next, I need to install NuGet packages:
$ dotnet add package Microsoft.ML
$ dotnet add package BetterConsoleTables
Microsoft.ML is the main ML.NET library, and BetterConsoleTables is a library that lets me output nice-looking tables to the console.
Now I’m ready to start coding. I will start by loading the training data in memory.
I will modify the Program.cs file like this:
This code uses the CreateTextLoader method to create a CSV data loader. The TextLoader.Options class describes how to load each field. Then I call the text loader’s Load method twice to load the train- and test data in memory.
Note how I’m loading the age column as a string and not a number. I’m doing this because ML.NET expects missing data in CSV files to appear as a ‘?’. Unfortunately my Titanic file uses a ‘0’ to indicate an unknown age.
So the first thing I need to do is replace all ‘0’ age occurrences with ‘?’.
I’ll need to declare two helper classes for this:
I’ll use the FromAge and ToAge classes in the next step to transform the data.
Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
I’m not interested in the Name, Cabin, and Ticket columns. The data in those columns will not help the machine learning model to predict if the passenger survived or perished. So I’ll drop them from my dataset with the DropColumns component.
Then I’ll add a CustomMapping component which will convert my unknown age values to ‘?’ strings:
Now ML.NET is happy with the age values. I’ll convert the string ages to numeric values and instruct ML.NET to replace missing values with the mean age over the entire dataset:
The Sex and Embarked columns are enumerations of string values. My machine learning model can deal with that, but I need to one-hot encode them first:
And finally, I’ll concatenate all remaining columns into a single feature column (a requirement for ML.NET training) and attach a Fast Tree Learner to train the model:
The FastTreeBinaryClassificationTrainer is a very nice training algorithm that uses gradient boosting, a machine learning technique for classification problems.
Now let’s take a look inside the ML.NET training pipeline. What are the final data columns? What does the actual data look like?
I’m going to add a helper method to the Program class that can write the contents of the pipeline to the console. Check this out:
This code uses the BetterConsoleTables library to write the pipeline to the console in a nice format. And there’s some extra code to display sparse- and dense vectors in a compact format so they don’t mess up the layout of the table.
With this method in place, add the following line to your Main method:
Run the code. Here’s what you’ll see:
Note that there are four Age columns: the original RawAge column, the new Age column with ‘?’ values, the Age column with numbers (and NaN for missing values), and the final Age column with missing values replaced with the mean age.
There are also three Sex columns: one with the original data, one with numbers representing each unique gender value, and one with the one-hot encoded vectors.
And the same goes for the Embarked column.
The last three columns are added by the FastTree learner. The model prediction is in the PredictedLabel column, and there’s also a Score and the Probability that the passenger survived.
So all I need to do now is train the model on the entire dataset, compare the predictions with the labels, and compute a bunch of metrics that describe how accurate my model is:
This code calls Fit to train the model on the entire dataset, Transform to set up a prediction for each passenger, and Evaluate to compare these predictions to the label and automatically calculate all evaluation metrics:
- Accuracy: this is the number of correct predictions divided by the total number of predictions.
- AUC: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good.
- AUCPRC: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive.
- F1Score: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive.
- LogLoss: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
- LogLossReduction: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
- PositivePrecision: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high.
- PositiveRecall: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high.
- NegativePrecision: this is the fraction of negative predictions that are correct.
- NegativeRecall: this is the fraction of negative predictions out of all negative cases.
I’m looking at a historic disaster, which means the cost of false positives and false negatives is about equal. So I can safely use the Accuracy metric to evaluate my model.
The data set also has a somewhat balanced distribution of positive and negative labels, so there’s no need to use the AUCPRC or F1Score metrics.
So I will focus on Accuracy and AUC to evaluate this model.
To wrap up, I’m going to create a new passenger and ask the model to make a prediction.
I’m going to take a trip on the Titanic. I embarked in Southampton and paid $70 for a first-class cabin. I travelled on my own without parents, children, or my spouse. What are my odds of surviving?
Before I can make a prediction, I need to set up two classes: one to hold a passenger record, and one to hold a passenger prediction:
Next, I’ll set up my data in a new passenger record, make the prediction, and output the results:
I use the CreatePredictionEngine method to set up a prediction engine. The two type arguments are the input data class and the class to hold the prediction. And once my prediction engine is set up, I can simply call Predict(…) to make a single prediction.
So how did I do? Would I have survived the Titanic disaster?
Here’s the code running in the Visual Studio Code debugger on my Mac:
… and in a zsh shell:
I’m getting an accuracy of 83.8%. It means that for every 100 Titanic passengers, my model is able to predict 83 of them correctly. That’s not bad at all.
I also get an AUC value of 0.8829. This is great, it means my model has good (almost excellent) predictive ability.
And I’m happy to learn that I survived the Titanic disaster. My model predicts that I had a 84.22% chance of making it off the ship alive. It’s probably because I booked a first-class cabin and travelled alone.
So what do you think?
Are you ready to start writing C# machine learning apps with ML.NET?