A Big Data Approach to Decision Trees - Big Data and Parallelism

less than 1 minute read


Typical Big Data tasks include handling massive amounts of data which can’t be cached in a single node memory or, that sequential processing is unfeasible for the application due to the processing time. These operations are applied into samples dimension so, must be parallel computed in order to have practical utility. However, there are also jobs done in features dimension that is usually small, making it feasible to process them with no parallelism.

In this project slices of the original dataset with up to 5 million samples are handled but, there are only 18 features in all of them so, not all the functions had to be implemented parallel.

Later in this section the main functions that were made parallel are briefly explained.