A Big Data Approach to Decision Trees - Splitting at The Best Feature

1 minute read

Published:

SplitByFeat(binRDD, feat_index):

Receives a (class, [features]) RDD and the index of a feature then, filters the dataset based on the value of given feature. Samples which are True for the feature are returned as new RDD called left and the samples which are False as right RDD.

def SplitByFeat(binRDD, feat_index):
    
    """
        Input: 
            binRDD     : distributed database in format (class, [features]) with features binarized
            feat_index : index of the feature to split binRDD
        
        Saída:
           leftRDD     : distributed database in format (class, [features]) with samples where choosen feature is TRUE
           rightRDD    : distributed database in format (class, [features]) with samples where choosen feature is FALSE
    """
    
    #Filho da Esquerda -> o atributo dado é TRUE
    leftRDD  = binRDD.filter(lambda sample : sample[1][feat_index])

    #Filho da Direita -> o atributo dado é FALSE
    rightRDD = binRDD.filter(lambda sample : not sample[1][feat_index])

    return leftRDD, rightRDD    

This function is called by the non-parallel function evalAndSplit(binRDD) that splits a node for every feature, calculates G for all and returns left, right, best split feature and related G.

def evalAndSplit(binRDD):

    """
        Input: 
            binRDD     : distributed database in format (class, [features]) with features binarized
        
        Output:
           leftRDD     : distributed database in format (class, [features]) with samples where best option feature is TRUE
           rightRDD    : distributed database in format (class, [features]) with samples where best option feature is FALSE
           bestOption  : index of the feature with maximum information gain
           maxIG       : information gain for the bestOption feature
    """

    candidatesIG = []
    
    for i in range(len(binRDD.first()[1])):
        candidatesIG.append(infoGain(binRDD,(SplitByFeat(binRDD, i))))
        
    maxIG      = max(candidatesIG)
    bestOption = candidatesIG.index(maxIG)
    
        
    left, right = SplitByFeat(binRDD, bestOption)
    
    return left, right, bestOption, maxIG