Improving your submission -- Kaggle Competitions

1: Improving Our Features

In the last mission, we made our first submission to Titanic: Machine Learning from Disaster, a machine learning competition on Kaggle.

Our submission wasn't very high-scoring, though. There are three main ways we can improve it:

Use a better machine learning algorithm.
Generate better features.
Combine multiple machine learning algorithms.

In this mission, we'll do all three. First, we'll find a different algorithm to use than logistic regression -- random forests.

2: Random Forest Introduction

As we alluded to in the previous mission, decision trees can pick up nonlinear tendencies in the data. Here's an example:

Age    Sex    Survived
5      0      1
30     1      0
70     0      1
20     0      1

As you can see, there isn't a linear correlation between Age and Survived -- someone who was 30 didn't survive, but people who were 70 and 20 did survive.

We can instead make a decision tree to model the relationship between Age, Sex, and Survived. You've probably seen decision trees or flowcharts before, and the decision trees algorithm isn't any different conceptually. We start with all of our data rows at the root of the tree, then we make splits until we can accurately classify the rows in the leaves. Here's an example:

decision trees

In the above diagram, we take our initial data, and:

Make an initial split. Any row where Age is over 29 goes to the right, and any row where Age is less than 29 goes to the left.
The left group all survived, so we make it a leaf node, and assign the Survived outcome 1.
The right group didn't all have the same outcome, so we split again, based on the Sex column.
We end up with two leaf nodes on the right side, one where everyone survived, and one where everyone didnt.

We could use this decision tree to figure out the survival outcome of a new row:

Age    Sex    Survived
40     0      ?

Based on our tree, we would first split to the right, then split to the left. We would predict that the person in the above row survived (1).

Decision trees have a major flaw, and that is that they overfit to the training data. Because we build up a very "deep" decision tree in terms of splits, we end up with a lot of rules that are specific to the quirks of the training data, and not generalizable to new data sets.

This is where the random forest algorithm can help. With random forests, we build hundreds of trees with slightly randomized input data, and slightly randomized split points. Each tree in a random forest gets a random subset of the overall training data. Each split point in each tree is performed on a random subset of the potential columns to split on. By averaging the predictions of all the trees, we get a stronger overall prediction and minimize overfitting.

3: Implementing A Random Forest

Thankfully for us, sklearn has a nice random forest implementation already. We can use it to construct a random forest and generate cross validated predictions on our dataset.

Instructions
Make cross validated predictions for the titanic dataframe (which has already been loaded in) using 3 folds.

Use the random forest algorithm stored in alg to do the cross validation.
Use predictors to predict the Survived column. Assign the result to scores.
You can use the cross_validation.cross_val_score function to do this.
- You'll need to initialize an instance of KFold like we did in the last mission, and pass it into the cv keyword argument of the cross_val_score function.

After making cross validated predictions, print out the mean of scores.
Hint
Make predictions with:

kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)      
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

执行代码：

from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

4: Parameter Tuning

The first, and easiest, thing we can do to improve the accuracy of the random forest is to increase the number of trees we're using. Training more trees will take more time, but because of the fact that we're averaging many predictions made on different subsets of the data, having more trees will increase accuracy greatly (up to a point).

We can also tweak the min_samples_split and min_samples_leaf variables to reduce overfitting. Because of how a decision tree works, having splits that go all the way down, or overly deep in the tree can result in fitting to quirks in the dataset, and not true signal. Thus, increasing min_samples_split and min_samples_leaf can reduce overfitting, which will actually improve our score, as we're making predictions on unseen data. A model that is less overfit, and that can generalize better, will actually perform better on unseen data, but worse on seen data.

Instructions
We've changed the parameters used when we initialize alg. We'll need to re-run our model now:

Make cross validated predictions for the titanic dataframe using 3 folds.
Use predictors to predict the Survived column and assign the result to scores.
After making cross validated predictions, print out the mean of scores.

Hint
The parameters given in alg are different.

alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], 3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

5: Generating New Features

We can also generate new features. Here are some ideas:

The length of the name -- this could pertain to how rich the person was, and therefore their position in the Titanic.
The total number of people in a family (SibSp + Parch).

An easy way to generate features is to use the .apply method on pandas dataframes. This applies a function you pass in to each element in a dataframe or series. We can pass in a lambda function, which enables us to define a function inline.

To write a lambda function, you write lambda x: len(x). x will take on the value of the input that is passed in -- in this case, the passenger name. The function to the right of the colon is then applied to x, and the result returned. The .apply method takes all of these outputs and constructs a pandas series from them. We can assign this series to a dataframe column.
Instructions
This step is a demo. Play around with code or advance to the next step.

# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

6: Using The Title

We can extract the title of the passenger from their name. The titles take the form of Master., Mr., Mrs.. There are a few very commonly used titles, and a "long tail" of one-off titles that only one or two passengers have.

We'll first extract the titles with a regular expression, and then map each unique title to an integer value.

We'll then have a numeric column that corresponds to the appropriate Title.
Instructions
This step is a demo. Play around with code or advance to the next step.

import re

# A function to get the title from a name.
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+).', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

# Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v

# Verify that we converted everything.
print(pandas.value_counts(titles))

# Add in the title column.
titanic["Title"] = titles

7: Family Groups

We can also generate a feature indicating which family people are in. Because survival was likely highly dependent on your family and the people around you, this has a good chance at being a good feature.

To get this, we'll concatenate someone's last name with FamilySize to get a unique family id. We'll then be able to assign a code to each person based on their family id.
Instructions
This step is a demo. Play around with code or advance to the next step.

import operator

# A dictionary mapping family name to id
family_id_mapping = {}

# A function to get the id given a row
def get_family_id(row):
    # Find the last name by splitting on a comma
    last_name = row["Name"].split(",")[0]
    # Create the family id
    family_id = "{0}{1}".format(last_name, row["FamilySize"])
    # Look up the id in the mapping
    if family_id not in family_id_mapping:
        if len(family_id_mapping) == 0:
            current_id = 1
        else:
            # Get the maximum id from the mapping and add one to it if we don't have an id
            current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
        family_id_mapping[family_id] = current_id
    return family_id_mapping[family_id]

# Get the family ids with the apply method
family_ids = titanic.apply(get_family_id, axis=1)

# There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
family_ids[titanic["FamilySize"] < 3] = -1

# Print the count of each unique id.
print(pandas.value_counts(family_ids))

titanic["FamilyId"] = family_ids

8: Finding The Best Features

Feature engineering is the most important part of any machine learning task, and there are lots more features we could calculate. But we also need a way to figure out which features are the best.

One way to do this is to use univariate feature selection. This essentially goes column by column, and figures out which columns correlate most closely with what we're trying to predict (Survived).

As usual, sklearn has a function that will help us with feature selection, SelectKBest. This selects the best features from the data, and allows us to specify how many it selects.
Instructions
We've updated predictors. Make cross validated predictions for the titanic dataframe using 3 folds.

Use predictors to predict the Survived column and assign the result to scores.

After making cross validated predictions, print out the mean of scores.
Hint
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
will make predictions.
Plots

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId", "NameLength"]

# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

9: Gradient Boosting

Another method that builds on decision trees is a gradient boosting classifier. Boosting involves training decision trees one after another, and feeding the errors from one tree into the next tree. So each tree is building on all the other trees that came before it. This can lead to overfitting if we build too many trees, though. As you get above 100 trees or so, it's very easy to overfit and train to quirks in the dataset. As our dataset is extremely small, we'll limit the tree count to just 25.

Another way to limit overfitting is to limit the depth to which each tree in the gradient boosting process can be built. We'll limit the tree depth to 3 to avoid overfitting.

We'll try boosting instead of our random forest approach and see if we can improve our accuracy.

10: Ensembling

One thing we can do to improve the accuracy of our predictions is to ensemble different classifiers. Ensembling means that we generate predictions using information from a set of classifiers, instead of just one. In practice, this means that we average their predictions.

Generally, the more diverse the models we ensemble, the higher our accuracy will be. Diversity means that the models generate their results from different columns, or use a very different method to generate predictions. Ensembling a random forest classifier with a decision tree probably won't work extremely well, because they are very similar. On the other hand, ensembling a linear regression with a random forest can work very well.

One caveat with ensembling is that the classifiers we use have to be about the same in terms of accuracy. Ensembling one classifier that is much worse than another probably will make the final result worse.

In this case, we'll ensemble logistic regression trained on the most linear predictors (the ones that have a linear ordering, and some correlation to Survived), and a gradient boosted tree trained on all of the predictors.

We'll keep things simple when we ensemble -- we'll average the raw probabilities (from 0 to 1) that we get from our classifiers, and then assume that anything above .5 maps to one, and anything below or equal to .5 maps to 0.
Instructions
This step is a demo. Play around with code or advance to the next step.

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]],
    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

# Initialize the cross validation folds
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    train_target = titanic["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)

# Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)

# Compute accuracy by comparing to the training data.
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)

11: Matching Our Changes On The Test Set

There are a lot of things we could do to make this analysis better that we'll talk about at the end, but for now, let's make a submission.

The first step is matching all our training set changes on the test set data, like we did in the last mission. We've read the test set into titanic_test. We'll have to match our changes:

Generate the NameLength column, which is how long the name is.
Generate the FamilySize column, showing how large a family is.
Add in the Title column, keeping the same mapping that we had before.
Add in a FamilyId column, keeping the ids consistent across the train and test sets.
Instructions
Add the NameLength column to titanic_test.
- Do this the same way we did it with the titanic dataframe.

Hint
You should be able to use our code from an earlier screen, with minor modifications.

# First, we'll add titles to the test set.
titles = titanic_test["Name"].apply(get_title)
# We're adding the Dona title to the mapping, because it's in the test set, but not the training set
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}
for k,v in title_mapping.items():
    titles[titles == k] = v
titanic_test["Title"] = titles
# Check the counts of each unique title.
print(pandas.value_counts(titanic_test["Title"]))

# Now, we add the family size column.
titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"]

# Now we can add family ids.
# We'll use the same ids that we did earlier.
print(family_id_mapping)

family_ids = titanic_test.apply(get_family_id, axis=1)
family_ids[titanic_test["FamilySize"] < 3] = -1
titanic_test["FamilyId"] = family_ids

12: Predicting On The Test Set

We have some better predictions now, so let's create another submission.
Instructions

Turn the predictions into either 0 or 1 by turning the predictions less than or equal to .5 into 0, and the predictions greater than .5 into 1.
Then, convert the predictions to integers using the .astype(int) method -- if you don't, Kaggle will give you a score of 0.
Finally, create a submission dataframe where the first column is PassengerId, and the second column is Survived (this will be the predictions).

Hint
Generate the submission dataframe with:

submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

执行代码:

predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],
    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

full_predictions = []
for alg, predictors in algorithms:
    # Fit the algorithm using the full training data.
    alg.fit(titanic[predictors], titanic["Survived"])
    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.
    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]
    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4
predictions[predictions <= .5] = 0
predictions[predictions > .5] = 1
predictions = predictions.astype(int)
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

13: Final Thoughts

Now, we have a submission! It should get you a score of .799 on the leaderboard. You can generate a submission file with submission.to_csv("kaggle.csv", index=False).

There's still more work you can do in feature engineering:

Try using features related to the cabins.
See if any family size features might help -- do the number of women in a family make the whole family more likely to survive?
Does the national origin of the passenger's name have anything to do with survival?
There's also a lot more we can do on the algorithm side:
Try the random forest classifier in the ensemble.
A support vector machine might work well with this data.
We could try neural networks.
Boosting with a different base classifier might work better.

And with ensembling methods:

Could majority voting be a better ensembling method than averaging probabilities?

This dataset is very easy to overfit on because there isn't a lot of data, so you'll be grinding for small accuracy gains. You could also try a different Kaggle competition with more data and richer features to sink into.

Hope you enjoyed this tutorial, and good luck with the machine learning competitions!