Decision Tree and Random Forest: Machine Learning in Python

Published in

DataDrivenInvestor

5 min readJan 14, 2021

Decisions, Decisions, Decisions… we make numerous decisions everyday; unconsciously or consciously, sometimes doing it automatically with little effort and sometimes, agonizing for hours over another. If only there was a way to chart a path to the conclusion.

Decision Tree is one of the most popular and powerful tool for Classification and Regression. Clear from its name, it is a flowchart like tree structure. It’s this property of Decision Tree that makes them easy to understand and interpret.

Anatomy of a Decision Tree

Research Paper: Olden, Julian & Lawler, Joshua & Poff, N.. (2008). Machine Learning Methods Without Tears: A Primer for Ecologists. The Quarterly review of biology. 83. 171–93. 10.1086/587826.

Root Node: The very first node of a tree structure is known as the Root Node.
Parent Node: Any node with a sub-node is called a Parent Node or Internal Nodes.
Child Node: Any sub-node of a given node is called a Child Node.
Leaf Node: Nodes which do not have any child nodes are Leaf Nodes or Terminal Nodes.

Algorithms to build a Decision Tree

There are a few ways to build a Decision Tree. The algorithm selection depends upon the type of Target Variable in the data. Some of these algorithms are:

ID3 — An extension of D3 Algorithm
Stands for Iterative Dichotomiser 3 which basically means the algorithm repeatedly divides features into two or more groups at each step. It is a top-down greedy approach. Used for Classification problems with Nominal features only.
C4.5 — Successor of ID3
Also referred to as Statistical Classifier, C4.5 is an extension of ID3 Algorithm. It has a few advantages over ID3; It is able to handle both continuous and discrete values; It allows missing attribute values marked as ?; It provides pruning after creation.
CART — Classification and Regression Tree
The CART Algorithm is structured like a sequence of follow up ‘questions’. We determine the split of a node based on these ‘questions’. The algorithm works for both Classification and Regression Problems. The result is a Binary Tree Structure.
CHAID — Chi-Squared Automatic Interaction Detector
The algorithm can work with nominal, ordinal, and continuous data. It creates all possible cross tabulations of each categorical predictor until the best outcome is achieved and no further splitting is possible.
MARS — Multivariate Adaptive Regression Splines
An algorithm for Complex Non-linear Regression Problems. The algorithm discovers a set of simple piecewise Linear Functions and uses them in aggregate to make predictions. In a sense, the model can be seen as an ensemble of Linear Functions.

Attribute Selection Measures

With a dataset of N attributes, we have to define a condition to decide which attribute to place as a root or parent node of the tree. For solving this problem, researchers worked and devised some solutions:

Entropy
It is the Measure of Randomness. The higher the entropy, the harder it is to draw any conclusions from the information.
Information Gain
A Statistical property that measures how well a given attribute separates the training examples according to their target classification. A feature is selected when IG is high and Entropy is low. We assume attributes to be Categorical.
Gini Index
Works with categorical target variable and performs only Binary Splits. Higher value of Gini Index indicates higher homogeneity. It is generally used with CART Algorithms. We assume Continuous Attributes.
Gain Ratio
C4.5 Algorithm uses Gain Ratio — a modification of Information Gain that reduces its bias and is usually the best option. It overcomes the drawbacks of Information Gain by taking into account the number of branches that would result before making the split.
Reduction in Variance
It is used for continuous/regression problems. This algorithm uses the standard variance formula to use as splitting condition.
Chi-Square
CHAID algorithm uses Chi-Square method to split the features. It also works with Categorical target variables. It can perform two or more splits. Higher Chi-Square values indicate higher statistical significance of difference between sub-node and Parent node.

Implementing Decision Tree from Scratch

Note: Dataset used for this example: haberman.csv from Kaggle

Drawbacks of Decision Tree

If there are no constraint set on the growth of Decision Tree, they tend to overfit the train data. Without constraints, the tree might ‘memorize’ the training data completely. This will give you 100% accuracy on the training data, but will not be able to provide such results with new data.

There are two ways to remove overfitting:

Pruning Decision Trees.
Random Forest.

Pruning Decision Trees

In Pruning, we trim the tree by removing the branches which tend to make our tree overfit without the cost of reduced accuracy. This is done with the help of splitting the original dataset into test-train data and optimizing the tree according to accuracy gained from test data.

Random Forest

Random Forest is an example of ensemble learning, where we combine multiple Decision Trees to obtain a better predictive performance. Random Forest are usually trained using ‘Bagging Method’ — Bootstrap Aggregating Method. Bagging is a meta-algorithm designed to improve stability and accuracy of Machine Learning Algorithm.

This article covers the basic definitions about various Decision Tree Algorithms and other parameters. It is recommended to search more upon these definitions and try them with a different dataset.

Thanks for reading.
Don’t forget to click on 👏!