A Big Data Approach to Decision Trees - Introduction to Decision Trees
Published:
Decision trees are one of the most popular predictive algorithms because they are easy for humans to understand as simple if-then rules are enough to define the whole model[1]. They are greedy search based algorithms from the supervised learning group, which use divide-and-conquer strategy to solve complex problems. The combination of sub-problems solutions builds an acyclic connected graph where the name trees comes from. These models can be implemented to solve regression problems receiving the name regression trees, on the other hand they are also widely applied to the classification problem when they are named decision trees[2], which are the subject of this project.
Vertices from decision trees are classified as leaf nodes or division nodes based on their characteristics:
Nodes which have no children are known as leaf nodes, each of them represents a subspace from the solutions space and the union of all of them is equal to whole solutions space. By applying a function to leafs’ features we must be able to define a class represented by the node for classification purposes[2], in this project we choose the mode because the dataset has a binary class implying in a 0 or 1 error function.
All the other nodes of the tree are known as division nodes because they carry information for a conditional test that splits their data into left and right children[2]. When the dataset has binary (or previous binarized) features, it is possible to reduce the conditional test to checking a specific attribute.
In this project the tree structure was stored in a linked list so, a class named node was defined in the code that, in addition to properties above, also implements metadata properties for parent, left child and right child in every instance.