Last Updated on July 30, 2022 by Jay
This tutorial will explain what a decision tree regression model is, and how to create and implement a decision tree regression model in Python in just 5 steps.
We’ll use three libraries for this exercise: pandas, sklearn, and matplotlib. To install them, type the following in the command prompt:
pip install pandas sklearn matplotlib
- pandas: for data wrangling work
- sklearn: library for machine learning models
- matplotlib: data visualization
Step 1 – Understanding How A Decision Tree Model Works
A decision tree is usually a binary tree consisting of the root node, decision nodes, and leaf nodes. As we can see below, it’s an up-side-down tree with root at the top, and leaves at the bottom of the tree.
Starting from the root (top) of the tree, the training data is split several different ways using multiple different conditions. At each decision, the node is a condition that splits the data in some way, and the leaf nodes indicate a final outcome. This terminology can sound complicated, but you’ve probably seen decision trees many times before in real life. Here is an example of a very simple decision tree that can be used to predict if you should buy a house:
A decision tree regression model builds this decision tree and then uses it to predict the outcome of a new data point. Although the above illustration is a binary (classification) tree, a decision tree can also be a regression model that can predict numerical values, and they are particularly useful because they are simple to understand and can be used on non-linear data. However, if the tree becomes too complicated and too large, we run the risk of overfitting. If we run into this issue we can consider reducing the depth of the tree to help avoid overfitting.
Step 2 – Getting The Data
We’ll be using one of sklearn‘s included datasets – the California housing data. No download is required and we can just import it from sklearn.
This dataset was derived from the 1990 US census. Each row represents a census block group, which is the smallest geographical unit for which the US Census Bureau publishes sample data. Each block group usually has a population of 600 ~ 3000 people.
import pandas as pd import sklearn as sk from sklearn import datasets from sklearn.datasets import fetch_california_housing housing_data = sk.datasets.fetch_california_housing()
The dataset is in a dictionary format that contains the actual data plus some metadata. Let’s take a look at it.
- data – contains 8 feature values (independent variables)
- target – target value is the median house value in hundreds of thousands of dollars ($100,000)
- target_name: this is the median house value
- MedInc – median income in block group
- HouseAge – median house age
- AveRooms – the average number of rooms per household
- AveBedrms – the average number of bedrooms per household
- Population – population in the block group
- AveOccup – average number of household members
- Latitude – block group latitude
- Longitude – block group longitude
Let’s put the data into a pandas dataframe. We are going to use the X variable to represent all the features (a table) and y variable to represent the target values (an array).
X = pd.DataFrame(housing_data.data, columns=housing_data.feature_names) y = housing_data.target
The target value that we are trying to predict is the median house value for California districts, expressed in hundreds of thousands of dollars. y contains all of the median house values for all of the houses in X.
Below is what the data should look like:
Categorical vs Numerical Data
Before we can start building our model, we usually need to clean up the data. For example, we should remove any data points that have missing values, and take note of any features that are categorical rather than numerical. Luckily, this dataset is already cleaned and all numerical.
The decision tree model works with both numerical and categorical data. However, with categorical data, we need to perform a one-hot encoding (i.e. converts categorical data into a one-hot numeric array). This will be the topic for another tutorial.
Step 3 – Splitting Data
We usually wouldn’t use all the data for training a model. The goal here is to avoid overfitting. We almost always should split the data into two portions: a training set and a testing set.
sklearn has a function that will split the data for us. We can also specify the split percentage. The default value is 75% for training and 25% for testing. However, for this model, we will split 90% for training and 10% for testing.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state=0)
Training Set (X_train and y_train) – This is the dataset we will use to teach (train) the model how to make predictions.
Testing Data (X_test and y_test) – After we’ve trained our model, we would use this dataset to test how accurate it is at predicting new data points that it hasn’t already seen in the training set. The idea is to test if the model we built using the training set can generalized well.
The random_state = 0 argument is to make sure the result is reproducible. Otherwise, each time we run the code, we’ll get a different split. In addition, this argument serves a similar purpose in later sections of the tutorial.
Without testing data our model will overfit training data – this means that our model will become too good at predicting values in the training set and it won’t be able to accurately predict (generalize) unseen new data points.
Step 4 – Building A Decision Tree Regression Model In Python
sklearn makes creating machine learning models very easy. We can create our model using the DecisionTreeRegressor constructor. For now we will use only the default arguments (by leaving all argument blank).
from sklearn.tree import DecisionTreeRegressor model = DecisionTreeRegressor(random_state = 0)
This creates our decision tree regression model, and now we need to “train” it using the training data. We can do this using the sklearn.fit method, which is the “secrect sauce” that finds the relationships between input variables and target variables.
Since we need the training data to train the model, we pass those as arguments.
Checking the accuracy of the model
Now we trained the model, we need to see how accurate it actually is using the testing data. sklearn has a built-in method score that gives us the coefficient of determination (R^2) of the model. Sometimes people also call this the accuracy, which represents how often is the prediction correct.
model.score(X_test, y_test) 0.5779699824126867
The best R^2 score is 1.0. A model that always predicts the same value regardless of the feature values will get an R^2 score of 0. Scores can sometimes also be negative. We want our model’s score to be between 0.0 and 1.0, and the closer to 1.0 the better.
As we can see, our model is mediocre at making predictions with only 57.8% accuracy, but it can definitely be better. Sometimes using the sklearn default parameters for building models will still yield a good model; however, that’s not always the case, but we don’t have to stop here!
Step 5 – Fine Tuning The Decision Tree Regression Model in (Python) sklearn
To make our model more accurate, we can try playing around with hyperparameters.
Hyperparameters are deliberate aspects of the model we can change. In the model, we can specify hyperparameters by using keyword arguments in the DecisionTreeRegressor constructor.
We can play around with different inputs for each hyperparameter and see what combinations improve the model’s score. Since one of the biggest problems we can have with decision tree models is if the tree becomes too big, we can start by limiting the max depth of the tree.
model = DecisionTreeRegressor(max_depth=5, random_state = 0) model.fit(X_train, y_train) model.score(X_test, y_test) 0.598388960870144
Since that’s not a great improvement, we can keep modifying the depth to see if we can make our model more accurate. After some experimenting, a depth of 10 increases the accuracy to 67.5%:
model = DecisionTreeRegressor(max_depth=10, random_state = 0) model.fit(X_train, y_train) model.score(X_test, y_test) 0.6751934766504792
Before we can look at the other hyperparameters, let’s quickly review how a decision tree machine learning model is built:
- Starting at the root of the tree, the training data is split several different ways using multiple different conditions
- For each of these splits there is a score that quantifies how “good” of a split it is. For example, a condition that splits the data 50-50 is not a very good split. The specific function that calculates the quality of the split is also a hyperparameter that we can specify.
- This process repeats for each internal decision node until we reach a leaf node. What constitutes a leaf node is also a hyperparameter we can specify.
Some other hyperparameters we could have modified to limit the size of the tree are:
- min_samples_split – specifies the minimum number of samples to split an internal node. The default value is 2 so increasing this value will limit the size of the tree
- min_samples_leaf – specifies how many samples are required to be at a leaf node. The default value is 1, so increasing this value will also limit the size of the tree
- max_leaf_nodes – controls how many leaf nodes the model can produce. Less leaf nodes will help prevent overfitting.
- max_features – specifies the maximum number of features that will be considered at each split. The default value is the number of features in your dataset, and decreasing this value helps prevent overfitting.
After some experimenting, we find that this set of hyperparameters yields a more accurate model:
model = DecisionTreeRegressor(max_depth=10, min_samples_split=2, min_samples_leaf=3, max_features=7, random_state = 0) model.fit(X_train, y_train) model.score(X_test, y_test) 0.6930562246423373
Instead of testing multiple values for each parameter one by one, we can automate this process and search for an optimal score using a combination of different values for each parameter. We’ll talk about this in another tutorial.
Another thing we can look at is feature importances, which are a quantitative measure of how much each of the features impact the outcome of the model. Using matplotlib and scikit’s built in method feature_importances we can visualize which of our features matter the most.
import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(10,10)) plt.barh(range(len(housing_data['feature_names'])), model.feature_importances_) plt.title("Feature Importances") plt.ylabel('Feature Names') plt.yticks(range(8), housing_data['feature_names'])
We can see that the median income is the feature that impacts the median house value the most.
There you have it, we just built a simple decision tree regression model using the Python sklearn library in just 5 steps.