A Big Data Approach to Decision Trees - Evaluating and Splitting

less than 1 minute read


In the decision tree algorithm, the information gain for each feature is calculated and the one which has the highest G is selected then, the parent node is split into left and right children[4].

The process described above is recursively applied in a “breadth-first search” like method until the information gain is zero for all the features. This is the stop criteria for the algorithm because, it means that a leaf was reached as all the instances of the subset are from the same class [4].

Decision trees are known overfitters so methods to prevent that are an important part of the algorithm. As depth increases less significant a node tends to be as fewer sample are in it [2] so, two pre-pruning methods were implemented in this project to deal with the issue: maximum tree depth and minimum node size (in terms of samples quantity).