## A Complete Guide to Decision Trees with a Step-by-Step Implementation from Scratch and Hands-On Example Using Scikit-Learn

- Introduction
- Decision trees for regression: the theory behind them
- From theory to practice — Decision Trees from scratch
- Hands-On Example — Implementation from scratch vs. Scikit-learn DecisionTree
- Summary
- References
- Appendix / Code

Decision Trees have been around since the 1960s. Despite being one of the simplest Machine Learning algorithms, they have proven to be highly effective in solving problems. One of their greatest advantages is their ease of interpretation, making them highly accessible to those without a technical background. In many industries, Data Scientists still have to build trust for Machine Learning use cases. Explainable baseline models like Decision Trees can help reduce the skepticism somewhat. If someone wanted to make the effort, they could even trace the branches of the learned tree and try to find patterns they already know about the problem.

On the other hand, we quickly reach the limits of simple decision trees for complex problems. Theoretically, we can model any (complex) distribution of the data with appropriately sized trees, but the models often do not generalize well enough when applied to new data — they overfit the train dataset. Yet, decision trees have always played an important role in machine learning.

Some weaknesses of Decision Trees have been gradually solved or at least mitigated over time by the progress made with Tree Ensembles. In Tree Ensembles, we do not learn one decision tree, but a whole series of trees and finally combine them into an ensemble. Nowadays we distinguish between bagging and boosting algorithms.

- In Bagging, multiple decision trees are trained on different bootstrap samples (randomly selected subsets with replacement) of the training data. Each decision tree is trained independently, and the final prediction is made by averaging the predictions of all the individual trees. The
**bagging**approach and in particular the**Random Forest**algorithm was developed by Leo Breiman. - In Boosting, decision trees are trained sequentially, where each tree is trained to correct the errors made by the previous tree. The training data is weighted, with more weight given to the misclassified samples from the previous tree.

Even if random forest still plays an important role, today it is mostly **boosting** algorithms that perform best in data science competitions and often outperform bagging methods. Among the best known boosting algorithms are **AdaBoost, XGBoost, LightGBM and CatBoost**. Since 2016, their growth in popularity has continued relentlessly.

While the concept of decision trees has been known and actively applied for several decades, boosting approaches are relatively “new” and only gradually gained importance with the release of XGBoost in 2014:

Just a few months after the initial release of the concept behind XGBoost, the Higgs Boson Challenge on Kaggle was won with it. XGBoost is based on a number of concepts that add up to an extremely effective algorithm. The core of XGBoost is of course, the principle of gradient boosting, but XGBoost is much more. XGBoost includes various optimization techniques, which makes XGBoost extremely efficient and fast in the training process. Especially for small to medium sized structured datasets, gradient boosting framerworks like XGBoost, LightGBM and CatBoost continue to play a mayor role.

It is not just my opinion. A good indicator are Kaggle competitions and their winning solutions.

In the article “The State of Competitive Machine Learning”, mlcontests.com evaluated more than 200 data competitions in 2022 on Kaggle and other competition platforms. According to the report, gradient-boosted decision trees (GBDT) are still the go-to approach for tabular data use cases in 2022 and could manage to win most of the competitions in this area. (Carlens, n.d.)

Apart from the good performance that gradient boosting algorithms are showing again and again, the biggest advantage of decision trees or tree ensembles is speed. In general, gradient-boosting frameworks are faster in training compared to NNs, which can be an important factor in many real-world problems.

Often the data set is not clear at the beginning of an ML project. A major part of the work is the compilation of the dataset and extraction of relevant features. If we change the dataset, add a new column, or just slightly change the way we convert categorical values to numerical ones, we need a measure of whether we have improved or worsened the overall process by doing so. In this process, we may train models several hundred times. A faster training time can thus decisively influence the time for the entire development process of ML use cases.

The following figure shows the individual steps along the ML pipeline. If we change one small thing in the process before we train the model, we have to re-evaluate the whole process and the resulting models.

**Content of the article:**

This article is intended to lay the foundation to dive into various types of tree-based ensemble algorithms which are all based on Decision Trees. The concept of decision trees is very intuitive and easy to understand. At first glance somewhat more complex are XGBoost, CatBoostc and LightGBM. But if you take a closer look, XGBoost is just a combination of different concepts, which are again easy to understand each by itself.

Once you understand the random forest and gradient boosting frameworks, you will be able to solve a wide range of data science problems. From classifications to regression to anomaly detection.

It’s kind of absurd that knowledge about Deep Learning frameworks like Pytorch, TensorFlow, and Co plays such a central role in almost every Data Science job posting. In many areas, you will spend most of your time collecting data, preparing data, and extracting features. If you have the right feature set, the model creation itself is often quite straightforward. If you deal mainly with tabular data, you will probably get quite far with bagging and boosting algorithms.

If you want to download the code used in the article all at once and use it for reference, you can find the code snippets used on github. You can also find the code for the decision tree algorithm that we will build in this article in the appendix, at the bottom of this article.

Decision trees are among the simplest machine learning algorithms. The way they work is relatively easy to explain.

We, as humans, try to solve complex problems by breaking them down into relatively simple yes or no decisions. When we want to buy a new car, we browse all the car websites we can find. After a while, we get a feeling for how much a certain car make and model should cost. We get a feeling for how big the cost difference is between luxury brands and cheaper manufacturers and how much more we have to pay for a 150 hp engine compared to a smaller 100 hp engine and so on.

Step by step, our brain memorizes certain ballpark values for certain combinations of features. When we then stop at a car dealer and go through the features of a car one by one, it is as if we are moving down a decision tree until we arrive at what we think is a fair price.

*Note: **Before we go into how Decision trees are built, it is important to mention that there are different Decision Tree algorithms. Some popular ones are ID3, C4.5, C5.0 and CART (Google Developers, 2017). Scikit-learns implementation is based on CART, which was first published in 1984 by Leo Breiman et al. Leo Breiman was an American statistician who shaped the approach of “bagging”, developed the random forest and thus contributed significantly to the further development of tree-based algorithms.*

**How do we start build the tree?**

To start the Decision Tree building process, we need to answer three questions:

**Which feature do we start with?**– We split the data set at each node along one dimension. For the example, we use the feature x_1 for splitting. Since we don’t want to choose just random features, we search in advance for the feature where splitting the data set offers the greatest added value. (In this context we often speak about the so-called information gain. We can define the information gain in different ways, in regression we often use the*squared error*).**What is the best threshold to split the data?**– Similar to the first step, when we choose a feature, we still need to know the threshold we want to use to split the data set. So in other words, at what position along the dimension we want to split the data set.**When do we stop splitting the data set?**– If we do not stop the splitting process at some point, the decision tree will continue until there is only one sample point in each leaf node. To avoid overfitting, we need some criteria to determine how far to split the data set and when to stop the splitting process so that the model does not become unnecessarily complex.

Let’s use the example of price prediction for cars again. First, we need to select one of the features and split the data set. We choose a feature and a threshold and split the dataset into a left and a right part and calculate the average price. That gives us a first node. If we stopped now, we would have a minimalistic decision tree with only one level – a so-called decision stump.

However, we do not want to start with a random split, but with the “best possible” split.

**But how is the “best” split defined?**

We need to define a metrics, that helps us to evaluate how good a split performs.

A often-used loss function in regression problems to assess how well a model performs is the **mean absolute error** or the **mean squared error**. Usually, we can choose between different metrics. To get an idea how scikit-learn is calculating the performance of each split, we can simply have a look into the documentation or directly in the source code.

The easiest way to access the source code is via the code editor.

If you have not yet installed scikit, you can do so with pip via:

`pip install scikit-learn`

I use Visual Studio Code for most of my projects. If I create a new notebook or Python file and import the corresponding modules, Visual Studio provides us a direct link to the source code behind it. In the picture on the left side, you can see how the whole thing looks like in Visual Studio Code.

- Create a new file, in my case “
**tree_algorithms.py**” and import the regression tree module “**sklearn.tree.DecisionTreeRegressor**“. - By pressing “
**Ctrl**” and clicking on the according module you will be redirected directly to the corresponding part of the source code.

Alternatively, you can find the source code in the documentation of scikit-learn. On the right you can see how the whole thing looks on *scikit-learn.org*. Each class and function has also a link to the source code on Github.

If we dive into the source code of DecisionTreeRegressor, we see that it is defined as a class that is initiated with the following values:

`def __init__(`

self,

*,

criterion="squared_error",

splitter="best",

max_depth=None,

min_samples_split=2,

min_samples_leaf=1,

min_weight_fraction_leaf=0.0,

max_features=None,

random_state=None,

max_leaf_nodes=None,

min_impurity_decrease=0.0,

ccp_alpha=0.0,):

We will gradually go into what the individual hyperparameters do. What we are interested in for the start is the splitting criterion, i.e. how the decision tree determines how it splits the data set during building.

The code also contains a short description of which criterias we can select.

Scikit-learn lets us choose between:

- squared_error
- friedmann_mse
- aboslute_error
- poisson

The default value is “squared_error”. The documentation describes it as follows:

“squared_error” is equal to variance reduction and minimizes the L2 loss using the mean of each terminal node

So we are trying to minimize the *Mean Squared Error *in the terminal nodes.

Let’s imagine we have a simple two-dimensional dataset with only x_1 as the single input feature and y as the target variable. For this simple example, we don’t need to decide which feature is best to split the dataset because we only have one in the dataset. So at the root node, we use x_1 to split the dataset in half.

In the following figure, you can find a simple 2 dimensional dataset. The two halves of the dataset are our child nodes. At the moment we perform the first split, the two child nodes are the leaf nodes or terminal nodes (nodes that are not further split).

In the case shown, we divide the data set at x_1=2.

**What value would the tree now predict if we used it to make prediction?**

We have to define a value for each terminal node, which then represents the possible prediction values of the decision tree. We calculate this prediction value y_pred in the simplest way, we calculate the average of the data points in the left node (**here: **y_pred_left = 9) and the right node (**here:** y_pred_right = 5).

**How do I find the best threshold for splitting the data set?**

In the example shown, we have chosen x_1 = 2 as the threshold. But is this the optimal scenario?

To evaluate the performance of the split, we calculate the residuals, i.e., the difference between y_predict and the y for each sample. More precisely, we calculate the L2 loss function, the sum of squared residuals.

To get a value for the performance of the stump, we calculate the deviations (l2 loss) for both sides separately and then calculate a weighted overall loss by including the number of samples in both halves.

We do that over and over again, for different thresholds (see image). In our example, the weighted squared error gets minimal when we choose x_1 = 5 as the splitting threshold:

**How does our algorithm find the smallest error?**

The decision tree does this in a very simple way, it defines an iterative approach, that tries different values as thresholds. Therefore, we define a list of possible thresholds/splitting values and calculate the mean squared error for each of the possible thresholds in the list.

**Step 1 – Define a list with possible thresholds for the splitting:**We are defining all possible splitting values by sorting the values and calculating the rolling average. So if x_1 = 2 and x_2 = 3 then the first threshold in the list of possible thresholds is 2.5.**Step 2:**In the next step, we need to find the threshold that minimizes the squared error when building a node. We start iterating over all thresholds, splitting the data set, and calculating the MSE for the right and left nodes.

Lets try it with a real world data set.

To demonstrate the steps just described on a real data set, we download the Automobile Data Set from UCIMachineLearning. The dataset includes a bunch of attributes like horsepower, dimensions, fuel type etc. which describe a car in detail. The target variable we are interested in is the price. (*UCI Machine Learning Repository: Automobile Data Set*, n.d.) *[License: **CC0: Public Domain**]*

`def load_auto_data_set():`# Load the automobile data set from UCI.edu

url = '<https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data>'

df = pd.read_csv(url, header=None)

# Name columns

df.columns = ['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_doors', 'body_style', 'drive_wheels', 'engine_location','wheel_base','length','width','height', 'curb_weight','engine_type','num_cylinders','engine_size','fuel_system','bore','stroke', 'compression_ratio','horsepower','peak_rpm','city_mpg','highway_mpg','price']

# Filter for lines where power and price are available

df = df[(df.horsepower != '?')]

df = df[(df.price != '?')]

# Filter for lines where power and price are available

df['horsepower'] = df['horsepower'].astype(int)

df['price'] = df['price'].astype(int)

# Define the last column of the data frame as y and the rest as X

self.y = self.df.iloc[:, -1]

self.X = self.df.iloc[:, :-1]

return df, X, y

Afterwards, we perform exactly the steps just described. The following code snippet uses the selected feature **selected_feature **and the defined threshold **threshold **to split the data set (X_parent, y_parent).

The plot shows the samples of the left and right child nodes and the average of the observations. If we stopped now, the child nodes would be the leaf nodes of the tree and the predicted value of the tree would be represented by the calculated mean of the two halves.

`class NodePlot():`def __init__(self, X_parent, y_parent, threshold, selected_feature):

self.selected_feature = selected_feature

self.x_column = X_parent[self.selected_feature]

self.y_parent = y_parent

self.data_set = np.column_stack((self.x_column, y_parent))

self.threshold = threshold

# define a list with all observations of the left and right leaf

self.left_y = self.data_set[self.data_set[:, 0]<self.threshold][:, 1]

self.left_x = self.data_set[self.data_set[:, 0]<self.threshold][:, 0]

self.right_y = self.data_set[self.data_set[:, 0]>=self.threshold][:, 1]

self.right_x = self.data_set[self.data_set[:, 0]>=self.threshold][:, 0]

# calculate the mean of the observations for the left and right leaf

self.parent_y_mean = np.mean(self.y_parent)

self.left_y_mean = np.mean(self.left_y)

self.right_y_mean = np.mean(self.right_y)

# calculate the weighted mean squared error

self.parent_mse = np.mean((y_parent - self.parent_y_mean)**2)

mse_l = np.mean((self.left_y - self.left_y_mean)**2)

mse_r = np.mean((self.right_y - self.right_y_mean)**2)

# calculate the number of instances in the parent and child nodes

n_l = len(self.left_y)

n_r = len(self.right_y)

n = len(self.data_set)

# calculate the weighted mse for child nodes

self.child_mse = (n_l/n) * mse_l + (n_r/n) * mse_r

def plot_split(self):

plt.rcParams['font.size'] = '16'

sns.set_style("darkgrid", {"axes.facecolor": ".9"})

fig = go.Figure()

fig.add_trace(

go.Scatter(

x=self.left_x,

y=self.left_y,

mode="markers",

name="Data set: left node",

line=dict(color="grey")

)

)

fig.add_trace(

go.Scatter(

x=self.left_x,

y=np.linspace(self.left_y_mean, self.left_y_mean, len(self.left_x)),

mode="lines",

name="Right node prediction",

line=dict(color="black")

)

)

# create go.scatter plot with black line

fig.add_trace(

go.Scatter(

x=self.right_x,

y=self.right_y,

mode="markers",

name="Data set: right node",

#line=dict(color="#ffe476")

line=dict(color="black")

)

)

fig.add_trace(

go.Scatter(

x=self.right_x,

y=np.linspace(self.right_y_mean, self.right_y_mean, len(self.right_x)),

mode="lines",

name="Left node prediction",

line=dict(color="black", dash='dot')

)

)

fig.add_trace(

go.Scatter(

x=[self.threshold, self.threshold],

y=[min(self.y_parent), max(self.y_parent)],

mode="lines",

name="MSE of parent node",

line=dict(color="black", dash='dashdot')

)

)

# update title in go.Figure

fig.update_layout(title="Data set", xaxis_title=self.selected_feature, yaxis_title=self.y_parent.name)

fig.show()

Since we don’t want to split the dataset anywhere, but at the “best” point, we do this iteratively as described above. We use the node plot class to calculate the residuals for a number of thresholds.

`selected_feature = "horsepower"`list_of_mse_childs = []

list_of_mse_parent = []

thresholds = X.sort_values(by=["horsepower"])["horsepower"].unique()

for threshold in thresholds:

NodePlot = helper_functions.NodePlot(

X_parent = X,

y_parent = y,

threshold = threshold,

selected_feature = "horsepower"

)

list_of_mse_childs.append(NodePlot.child_mse)

list_of_mse_parent.append(NodePlot.parent_mse)

def plot_threshold_evaluation(thresholds, mse_parent_list, mse_list):

# create figure

fig = go.Figure()

fig.add_trace(

go.Scatter(

x=thresholds,

y=mse_list,

mode="lines",

name="MSE after split",

line=dict(color="black")

)

)

fig.add_trace(

go.Scatter(

x=thresholds,

y=mse_parent_list,

mode="lines",

name="MSE of parent node",

line=dict(color="black", dash='dot')

)

)

fig.add_trace(

go.Scatter(

x=[threshold,threshold],

y=[min(mse_list), max(mse_list)],

mode="lines",

name="Chosen threshold",

line=dict(color="black", dash='dashdot')

)

)

# update title in go.Figure

fig.update_layout(title="Evaluate", yaxis_title='MSE')

fig.show()

return fig

# plot the just calculated MSE values for different thresholds

plot_threshold_evaluation(

thresholds = thresholds,

mse_parent_list = list_of_mse_parent,

mse_list = list_of_mse_childs,

threshold = 100

)

We get the squared error of the parent and child nodes for each possible split. For decision trees, we are searching for the maximal** information gain**. Using the squared error as a splitting criterion, we calculate the information gain simply as the difference between the MSE of the parent node and the weighted MSE of the child nodes.

In the chart, the Information Gain maximum lies at 120 hs.

In the Decision Tree, we perform the steps just described again and again.

*Note: The decision tree is counted among the greedy algorithms. Greedy algorithms gradually select the sequential state that promises the best result during selection. Previous and subsequent decisions are not taken into account. when making this decision.*

**When does the procedure end?**

Decision trees do not make many assumptions while training a model. Linear Regression, for example, is just the opposite, while the linear regression algorithm trains a model, it allows only one possible shape of the model, a straight line or a planar plane in space. Thus, when we use Linear Regression as a learning algorithm, we directly make the assumption that our problem follows a linear behavior.

Decision trees, on the other hand, are very flexible in their learning process. Such models are called “nonparametric models”. Models are called non-parametric when their number of parameters is not determined in advance. Linear regression has a well-defined number of parameters, the slope and the offset. This significantly limits the degree of freedom in the training process. (Géron, 2022)

Decision trees thus tend to overfit. To avoid that, we need to introduce hyperparameters that limit the freedom of the training process, so-called regularization hyperparameters.

A frequently used regularization parameter is **max_depth**, i.e. the maximum depth of the tree. Others are:

**min_samples_split**(the minimum number of samples a node needs to be split)**min_samples_leaf**(the minimum number of samples each leaf node must have)**min_weight_fraction_leaf**(similar to min_samples_leaf, but instead of a number we define a fraction of the whole dataset)**max_leaf_nodes**(maximum number of leaf nodes)**max_features**(the maximum number of features evaluated at each split).

After the actual model building, we can still prune the tree to avoid unnecessary complexity of the model.

**What is pruning?**

This technique involves growing the tree to its full size and then removing branches or subtrees that do not improve the accuracy of the tree on a validation dataset. This is done by calculating the change in error before and after pruning a subtree and comparing it to a threshold value. If the change in error is not significant, the subtree is pruned. I don’t want to go further into this for the moment, as I will leave it out for the following simple example.

In the following, I’ll show you how to build a basic version of a regression tree from scratch.

To be able to use the regression tree in a flexible way, we put the code into a new module. We create a new Python file, where we put all the code concerning our algorithm and the learning process. In it, we define a new class called “RegressionTree” and initialize it with the hyperparameters that serve as constraints for the training process.

As mentioned earlier, one of the biggest challenges in working with decision trees is the risk of overfitting. To mitigate this risk and ensure that our model generalizes well to new data, we introduce regularisation parameters that guide and stop the learning process at a certain point.

The regulation parameters (or stopping criteria) that we use in our simplified version of Decision Trees are the following two:

**min_samples_split**

- defines the maximum number of samples that a node needs in order to be split further. A suitable value depends on the type and size of the dataset. If chosen correctly, it can prevent overfitting. In scikit-learn, the default value is set to 2 samples.

**max_depth**

- the maximum depth determines how many levels the tree can have at most. If another stopping criterion such as
*min_sample_split*prevents further growth of the tree on all branches before this depth is reached, it may not even reach this tree size. Scikit-learn sets the default value to “None”, so the maximum depth is not limited by default.

Scikit-learn includes a few additional stopping parameters such as **min_samples_leaf, min_weighted_fraction, max_leaf_nodes, or max_features**, which are also not set by default and I will ignore for now.

A function that we need for every regressor is the fit function, which starts the training process. Input variables are a multi-dimensional array (X) with the input features. y is a one-dimensional array and describes the target variable.

In addition to our regressor (**RegressionTree**), we define a second class (**Node**) through which we set and store the parameters that each node of the tree has.

class Node():

def __init__(

self,

feature=None,

threshold=None,

left=None,

right=None,

value=None

):

self.feature = feature

self.threshold = threshold

self.left = left

self.right = right

self.value = value # is it a leave node?def is_leaf_node(self):

return self.value is not None

class RegressionTree():

def __init__(

self,

min_samples_split=2,

max_depth=100):

self.min_samples_split = min_samples_split

self.max_depth = max_depth

self.root = None

def fit(self, X, y):

self.root = self._grow_tree(X, y)

The fit function uses the helper function **_grow_tree(x, y)** to grow the tree piece by piece until one of the stopping criteria is reached.

Before splitting the node, we check if one of the stopping criteria is met. In the simplified example, we only have two stopping criteria:

**(1) depth >= self.max_depth:** Is the maximum depth of the tree reached?

**(2) n_samples < self.min_samples_split:** Is the number of samples in the node greater than **min_samples_split**?

If either condition is true, the node is a terminal node and the only thing we need to calculate is the mean (**np.mean(y)**).

If neither of the two conditions is true, we split the data set further. We first define which features we consider for the split. In our simplified case we do not limit the columns we use for the split. We use all features in X (**feat_idxs**).

For the actual split, we define another helper function **_best_split **which we pass the x and y values of the node we are looking at.

I’ll go into more detail about what **_best_split** does in a moment, but as the name implies **_best_split **returns us the “best” split, in the form of the selected feature for the split (**best_features**) and the threshold at which we split the dataset (**best_threshold**).

We use this information to actually split the dataset and store it as a node of our tree.

Before we jump out of the function we call **_grow_tree **again for both halves of the branch.

`def _grow_tree(self, X, y, depth=0):`

# check the stopping criteria

n_samples, n_feats = X.shapeif (depth>=self.max_depth or n_samples<self.min_samples_split):

leaf_value = np.mean(y)

return Node(value=leaf_value)

feat_idxs = np.random.choice(n_feats, n_feats, replace=False)

# find the best split

best_thresh, best_feature = self._best_split(X, y, feat_idxs)

# create child nodes

left_idxs, right_idxs = self._split(X[:, best_feature], best_thresh)

left = self._grow_tree(X[left_idxs, :], y[left_idxs], depth+1)

right = self._grow_tree(X[right_idxs, :], y[right_idxs], depth+1)

return Node(best_feature, best_thresh, left, right)

The only question that remains unanswered is how the algorithm figures out what the best split would be.

As mentioned before, we calculate a so-called information gain. In our case, we define the information gain as a reduction of the mean squared error. The errors or residuals for the node itself and the resulting child nodes is calculated as the difference between the average value of the target variable y in each node and the actual y values of the samples in the nodes.

The function goes through each feature one by one.

- We compute a set of possible thresholds for each feature as a moving average of all observations.
- Then we iterate over each threshold in the list, split the dataset and calculate a weighted mean squared error of the child nodes.
- Afterwards, we check if the calculated MSE is the smallest MSE calculated so far, if yes, we save the feature_idx and threshold as optimal. (in
**best_feature_idxs**and**best_thresholds**)

`def _best_split(self, X, y, feat_idxs):`

y_mean = np.mean(y)

residuals_y = (y - y_mean)**2

y_mse = np.mean(residuals_y)best_feature_ixd, best_threshold = None, None

lowest_mse = y_mse

for feat_idx in feat_idxs:

# define possible thresholds for the split

X_column = X[:, feat_idx]

thresholds = np.convolve(np.sort(X_column), np.ones(2)/2, mode='valid')

for threshold in thresholds:

# getting the left and right nodes

left_idxs, right_idxs = self._split(X_column, threshold)

# calculate the weighted avg. mse of children

n = len(y)

n_l, n_r = len(left_idxs), len(right_idxs)

mse_l = self._squared_error(y[left_idxs])

mse_r = self._squared_error(y[right_idxs])

child_mse = (n_l/n) * mse_l + (n_r/n) * mse_r

if lowest_mse > child_mse:

lowest_mse = child_mse

best_feature_ixd = feat_idx

best_threshold = threshold

return best_feature_ixd, best_threshold

Two functions which we have already used several times in the above sections are **_split **and **_squared_error**.

`def _split(self, X_column, split_thresh):`

left_idxs = np.argwhere(X_column <= split_thresh).flatten()

right_idxs = np.argwhere(X_column > split_thresh).flatten()

return left_idxs, right_idxsdef _squared_error(self, y):

# calculate the mean value for all observations

y_mean = np.mean(y)

# calculate the residuals to y_mean

mean_squared_error = np.mean((y - y_mean)**2)

return mean_squared_error

The only thing we still need is a **predict()** function. For this we use** _traverse_tree**.

Using a loop function we go through the just built tree one by one. If we reach a leaf node, **_traverse_tree** returns the stored node value.

`def predict(self, X):`

return np.array([self._traverse_tree(x) for x in X])def _traverse_tree(self, x, node):

if node.is_leaf_node():

return node.value

if x[node.feature] <= node.threshold:

return self._traverse_tree(x, node.left)

return self._traverse_tree(x, node.right)

That’s it, the complete *decision tree regressor* is defined as:

`import numpy as np`

from collections import Counter

from sklearn.metrics import mean_squared_error

from collections import Counterclass Node():

def __init__(

self,

feature=None,

threshold=None,

left=None,

right=None,

value=None

):

self.feature = feature

self.threshold = threshold

self.left = left

self.right = right

self.value = value # is it a leave node?

def is_leaf_node(self):

return self.value is not None

class RegressionTree():

def __init__(

self,

min_samples_split=2,

max_depth=100):

self.min_samples_split = min_samples_split

self.max_depth = max_depth

self.root = None

def fit(self, X, y):

self.root = self._grow_tree(X, y)

def _grow_tree(self, X, y, depth=0):

# check the stopping criteria

n_samples, n_feats = X.shape

if (depth>=self.max_depth or n_samples<self.min_samples_split):

leaf_value = np.mean(y)

return Node(value=leaf_value)

feat_idxs = np.random.choice(n_feats, n_feats, replace=False)

# find the best split

best_feature_ixd, best_threshold = self._best_split(X, y, feat_idxs)

# create child nodes

left_idxs, right_idxs = self._split(X[:, best_feature_ixd], best_threshold)

left = self._grow_tree(X[left_idxs, :], y[left_idxs], depth+1)

right = self._grow_tree(X[right_idxs, :], y[right_idxs], depth+1)

return Node(best_feature_ixd, best_threshold, left, right)

def _best_split(self, X, y, feat_idxs):

y_mean = np.mean(y)

residuals_y = (y - y_mean)**2

y_mse = np.mean(residuals_y)

best_feature_ixd, best_threshold = None, None

lowest_mse = y_mse

for feat_idx in feat_idxs:

# define possible thresholds for the split

X_column = X[:, feat_idx]

thresholds = np.convolve(np.sort(X_column), np.ones(2)/2, mode='valid')

for threshold in thresholds:

# getting the left and right nodes

left_idxs, right_idxs = self._split(X_column, threshold)

# calculate the weighted avg. mse of children

n = len(y)

n_l, n_r = len(left_idxs), len(right_idxs)

mse_l = self._squared_error(y[left_idxs])

mse_r = self._squared_error(y[right_idxs])

child_mse = (n_l/n) * mse_l + (n_r/n) * mse_r

if lowest_mse > child_mse:

lowest_mse = child_mse

best_feature_ixd = feat_idx

best_threshold = threshold

return best_feature_ixd, best_threshold

def _split(self, X_column, split_thresh):

left_idxs = np.argwhere(X_column <= split_thresh).flatten()

right_idxs = np.argwhere(X_column > split_thresh).flatten()

return left_idxs, right_idxs

def _squared_error(self, y):

# calculate the mean value for all observations

y_mean = np.mean(y)

# calculate the residuals to y_mean

mean_squared_error = np.mean((y - y_mean)**2)

return mean_squared_error

def predict(self, X):

return np.array([self._traverse_tree(x, self.root) for x in X])

def _traverse_tree(self, x, node):

if node.is_leaf_node():

return node.value

if x[node.feature] <= node.threshold:

return self._traverse_tree(x, node.left)

return self._traverse_tree(x, node.right)

**Load the data set**

For the test, we use the dataset already used as an example earlier, the automobile dataset. First, we load the dataset from uci.edu. Then we select a few attributes for the first simple test. For the following example, I choose:

- Wheel Base
- Length
- Width
- Hight
- Make

for the input vector X.

Since the attribute “make” contains strings, we transform it into numeric features using OneHot Encoding.

`import pandas as pd`

import numpy as np

from sklearn.model_selection import train_test_splitdef load_auto_data_set():

# Load the automobile data set from UCI.edu

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'

df = pd.read_csv(url, header=None)

# Name columns

df.columns = [

'symboling', 'normalized_losses', 'make',

'fuel_type', 'aspiration', 'num_doors',

'body_style', 'drive_wheels', 'engine_location',

'wheel_base','length','width','height',

'curb_weight','engine_type','num_cylinders',

'engine_size','fuel_system','bore','stroke',

'compression_ratio','horsepower','peak_rpm',

'city_mpg','highway_mpg','price'

]

# Filter for lines where power and price are available

df = df[(df.horsepower != '?')]

df = df[(df.price != '?')]

df = df.reset_index()

# Filter for lines where power and price are available

df['horsepower'] = df['horsepower'].astype(int)

df['price'] = df['price'].astype(int)

# Define the last column of the data frame as y and the rest as X

y = df.iloc[:, -1]

X = df.iloc[:, :-1]

return df, X, y

df, X, y = load_auto_data_set()

from sklearn.preprocessing import OneHotEncoder

X_selected = X[["wheel_base", "length", "width","height"]].reset_index()

# define and fit the OneHotEncoder

ohe = OneHotEncoder()

ohe.fit(df[['make']])

# transform the "make" column

make_one_hot_sklearn = pd.DataFrame(ohe.transform(df[["make"]]).toarray(), columns=ohe.categories_[0])

X = X_selected.join(make_one_hot_sklearn)

X = np.array(X)

y = np.array(y)

**Train the model**

After the input and output variables are defined, we try to run a first test with our algorithm to see if we can actually use it for predictions.

First, we split the dataset into a train and a test data set.

`import tree_algorithms`

from sklearn.metrics import mean_squared_error, mean_absolute_errorX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

regr = tree_algorithms.RegressionTree(min_samples_split=5, max_depth=20)

regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)

# calculate mean squared error

print(f"Mean Squared Error: {round(mean_squared_error(y_test, y_pred), 1)}")

**Compare it to the existing scikit-learn library**

For comparison, we now try the same with the “DecisionTreeRegressor” library from scikit-learn. Our trained model performs exactly the same in this case. I won’t go into whether the result is good or bad here. If you want to know how to tune a regression model and find the model with the best performance, you can find a more detailed explanation of suitable methods in one of my previous articles (here or here).

`from sklearn.datasets import load_diabetes`

from sklearn.model_selection import cross_val_score

from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_errorregr = DecisionTreeRegressor(min_samples_split=5, max_depth=20)

regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)

# calculate mean squared error

print(f"Mean Squared Error: {round(mean_squared_error(y_test, y_pred), 1)}")

The Decision Tree is the basis for a number of outstanding algorithms such as Random Forest, XGBoost, LightGBM and CatBoost. The concepts behind them are very intuitive and generally easy to understand, at least as long as you try to understand the individual subconcepts piece by piece. With this article, you have taken a good first step by understanding the core of every tree ensemble algorithm, the decision tree.

I plan to publish more articles about each concept that makes gradient-boosting frameworks so efficient.

Enjoyed the story?

*If you enjoyed reading and want to learn more about Machine Learning concepts, algorithms and applications, you can find a list with**all of my related articles.**If you don’t want to miss a new article, you can**subscribe for free**to get notified whenever I publish a new story.**Become a Medium member to read more stories from other writers and me. You can support me by using my**referral link**when you sign up. I’ll receive a commission at no extra cost to you.*

Feel free to reach out to me on LinkedIn if you have any questions!