# Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

## Marco Camargo

Industrial Automation Engineer & Data Science Enthusiast

## Markdown

This is a page not in th emain menu

## A Big Data Approach to Decision Trees

Published:

This work implements a decision tree from scratch to predict labels for dataset SUSY from UCI Machine Learning Repository, using Python 3 and Apache Spark but no machine learning libraries.

## A Big Data Approach to Decision Trees - Introduction to Decision Trees

Published:

Decision trees are one of the most popular predictive algorithms because they are easy for humans to understand as simple if-then rules are enough to define the whole model. They are greedy search based algorithms from the supervised learning group, which use divide-and-conquer strategy to solve complex problems. The combination of sub-problems solutions builds an acyclic connected graph where the name trees comes from. These models can be implemented to solve regression problems receiving the name regression trees, on the other hand they are also widely applied to the classification problem when they are named decision trees, which are the subject of this project.

## A Big Data Approach to Decision Trees - Information Gain and Entropy

Published:

The main task for decision tree algorithms is to evaluate how good splitting a node by the values of a given attribute is then, recursively perform the division for the best option until leafs are reached. The Information Gain (G) is the metric for choosing the best option and, is given by equation below for a Set (S) in relation to an Attribute (A),

## A Big Data Approach to Decision Trees - Evaluating and Splitting

Published:

In the decision tree algorithm, the information gain for each feature is calculated and the one which has the highest G is selected then, the parent node is split into left and right children.

## A Big Data Approach to Decision Trees - Big Data and Parallelism

Published:

Typical Big Data tasks include handling massive amounts of data which can’t be cached in a single node memory or, that sequential processing is unfeasible for the application due to the processing time. These operations are applied into samples dimension so, must be parallel computed in order to have practical utility. However, there are also jobs done in features dimension that is usually small, making it feasible to process them with no parallelism.

## A Big Data Approach to Decision Trees - Cost Function

Published:

costFunction(RDD, metric = 'Entropy'):

## A Big Data Approach to Decision Trees - Splitting at The Best Feature

Published:

SplitByFeat(binRDD, feat_index):

## A Big Data Approach to Decision Trees - Auxilary Functions

Published:

meanRDD(RDD): eturns the mean value of every features by applying a (class, [features]) to ([features], 1) map, reducing with adding and doing the division [sum of each feature]/n.

## A Big Data Approach to Decision Trees - Building the Tree

Published:

Building the tree itself does not require parallel computing because it is basically on updating tree nodes properties, controlling the processing queue (list of tree nodes to be processed) and the model (dictionary of nodes in the tree).

Published: