A Big Data Approach to Decision Trees
Published:
This work implements a decision tree from scratch to predict labels for dataset SUSY from UCI Machine Learning Repository, using Python 3 and Apache Spark but no machine learning libraries.
Published:
This work implements a decision tree from scratch to predict labels for dataset SUSY from UCI Machine Learning Repository, using Python 3 and Apache Spark but no machine learning libraries.
Published:
Decision trees are one of the most popular predictive algorithms because they are easy for humans to understand as simple if-then rules are enough to define the whole model[1]. They are greedy search based algorithms from the supervised learning group, which use divide-and-conquer strategy to solve complex problems. The combination of sub-problems solutions builds an acyclic connected graph where the name trees comes from. These models can be implemented to solve regression problems receiving the name regression trees, on the other hand they are also widely applied to the classification problem when they are named decision trees[2], which are the subject of this project.
Published:
The main task for decision tree algorithms is to evaluate how good splitting a node by the values of a given attribute is then, recursively perform the division for the best option until leafs are reached. The Information Gain (G) is the metric for choosing the best option and, is given by equation below for a Set (S) in relation to an Attribute (A),
Published:
In the decision tree algorithm, the information gain for each feature is calculated and the one which has the highest G is selected then, the parent node is split into left and right children[4].
Published:
Typical Big Data tasks include handling massive amounts of data which can’t be cached in a single node memory or, that sequential processing is unfeasible for the application due to the processing time. These operations are applied into samples dimension so, must be parallel computed in order to have practical utility. However, there are also jobs done in features dimension that is usually small, making it feasible to process them with no parallelism.
Published:
costFunction(RDD, metric = 'Entropy')
:
Published:
SplitByFeat(binRDD, feat_index)
:
Published:
meanRDD(RDD)
: eturns the mean value of every features by applying a (class, [features])
to ([features], 1)
map, reducing with adding and doing the division [sum of each feature]/n
.
Published:
Building the tree itself does not require parallel computing because it is basically on updating tree nodes properties, controlling the processing queue (list of tree nodes to be processed) and the model (dictionary of nodes in the tree).
Published:
Published:
[1] L. M. O. da Silva, Uma Aplicação de Árvores de Decisão, Redes Neurais e KNN Para a Identificação de Modelos ARMA Não-Sazionais e Sazionais, Ph.D. thesis, Pontifícia Universidade Cat[olica do Rio de Janeiro, 2005.
Published:
In this personal project I have implemented a simple temperature control system to cool the temperature of my aquarium down with world famous Raspberry PI 3.