A Big Data Approach to Decision Trees

less than 1 minute read


This work implements a decision tree from scratch to predict labels for dataset SUSY from UCI Machine Learning Repository, using Python 3 and Apache Spark but no machine learning libraries.

The focus is to illustrate how decision trees work using concepts of Big Data in an easy way but, not necessarily with the best possible performance. If you believe that something can be improved, comments and suggestions are very welcomed by e-mail.


  1. Introduction to Decision Trees
  2. Big Data and Parallelism
  3. Building the Tree

  4. Implementation

  5. References