Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Page Not Found

Page not found. Your pixels are in another canvas.

Marco Camargo

Industrial Automation Engineer & Data Science Enthusiast

academicpages is a ready-to-fork GitHub Pages template for academic personal websites

About me

Jupyter notebook markdown generator

Posts

machine-learning

A Big Data Approach to Decision Trees

less than 1 minute read

Published: May 13, 2018

This work implements a decision tree from scratch to predict labels for dataset SUSY from UCI Machine Learning Repository, using Python 3 and Apache Spark but no machine learning libraries.

A Big Data Approach to Decision Trees - Introduction to Decision Trees

1 minute read

Published: May 13, 2018

Decision trees are one of the most popular predictive algorithms because they are easy for humans to understand as simple if-then rules are enough to define the whole model[1]. They are greedy search based algorithms from the supervised learning group, which use divide-and-conquer strategy to solve complex problems. The combination of sub-problems solutions builds an acyclic connected graph where the name trees comes from. These models can be implemented to solve regression problems receiving the name regression trees, on the other hand they are also widely applied to the classification problem when they are named decision trees[2], which are the subject of this project.

A Big Data Approach to Decision Trees - Information Gain and Entropy

1 minute read

Published: May 13, 2018

The main task for decision tree algorithms is to evaluate how good splitting a node by the values of a given attribute is then, recursively perform the division for the best option until leafs are reached. The Information Gain (G) is the metric for choosing the best option and, is given by equation below for a Set (S) in relation to an Attribute (A),

A Big Data Approach to Decision Trees - Evaluating and Splitting

less than 1 minute read

Published: May 13, 2018

In the decision tree algorithm, the information gain for each feature is calculated and the one which has the highest G is selected then, the parent node is split into left and right children[4].

A Big Data Approach to Decision Trees - Big Data and Parallelism

less than 1 minute read

Published: May 13, 2018

Typical Big Data tasks include handling massive amounts of data which can’t be cached in a single node memory or, that sequential processing is unfeasible for the application due to the processing time. These operations are applied into samples dimension so, must be parallel computed in order to have practical utility. However, there are also jobs done in features dimension that is usually small, making it feasible to process them with no parallelism.

A Big Data Approach to Decision Trees - Cost Function

1 minute read

Published: May 13, 2018

costFunction(RDD, metric = 'Entropy'):

A Big Data Approach to Decision Trees - Splitting at The Best Feature

1 minute read

Published: May 13, 2018

SplitByFeat(binRDD, feat_index):

A Big Data Approach to Decision Trees - Auxilary Functions

1 minute read

Published: May 13, 2018

meanRDD(RDD): eturns the mean value of every features by applying a (class, [features]) to ([features], 1) map, reducing with adding and doing the division [sum of each feature]/n.

A Big Data Approach to Decision Trees - Building the Tree

2 minute read

Published: May 13, 2018

Building the tree itself does not require parallel computing because it is basically on updating tree nodes properties, controlling the processing queue (list of tree nodes to be processed) and the model (dictionary of nodes in the tree).

A Big Data Approach to Decision Trees - Implementation

8 minute read

Published: May 13, 2018

A Big Data Approach to Decision Trees by Marco Camargo

A Big Data Approach to Decision Trees - References

less than 1 minute read

Published: May 13, 2018

[1] L. M. O. da Silva, Uma Aplicação de Árvores de Decisão, Redes Neurais e KNN Para a Identificação de Modelos ARMA Não-Sazionais e Sazionais, Ph.D. thesis, Pontifícia Universidade Cat[olica do Rio de Janeiro, 2005.

Smart Aquarium

less than 1 minute read

Published: January 17, 2019

In this personal project I have implemented a simple temperature control system to cool the temperature of my aquarium down with world famous Raspberry PI 3.

Marco Camargo

Sitemap

Pages

Page Not Found

Marco Camargo

academicpages is a ready-to-fork GitHub Pages template for academic personal websites

Archive Layout with Content

Posts by Category

Posts by Collection

CV

Projects

Markdown

Page not in menu

Page Archive

Portfolio

Publications

Sitemap

Posts by Tags

Talk map

Talks and presentations

Teaching

Terms and Privacy Policy

Blog posts

Jupyter notebook markdown generator

Posts

machine-learning

A Big Data Approach to Decision Trees

A Big Data Approach to Decision Trees - Introduction to Decision Trees

A Big Data Approach to Decision Trees - Information Gain and Entropy

A Big Data Approach to Decision Trees - Evaluating and Splitting

A Big Data Approach to Decision Trees - Big Data and Parallelism

A Big Data Approach to Decision Trees - Cost Function

A Big Data Approach to Decision Trees - Splitting at The Best Feature

A Big Data Approach to Decision Trees - Auxilary Functions

A Big Data Approach to Decision Trees - Building the Tree

A Big Data Approach to Decision Trees - Implementation

A Big Data Approach to Decision Trees by Marco Camargo

A Big Data Approach to Decision Trees - References

Smart Aquarium