A Big Data Approach to Decision Trees - Cost Function

1 minute read

Published:

costFunction(RDD, metric = 'Entropy'):

This function receives a (class, [features]) RDD and returns its entropy based on a binary class. It maps the RDD to (class, 1) like tuples then reduces it by adding samples. As class values are {0,1}, its sum gives the the number of samples of class 1 and the second element of the tuple is number of samples in the dataset.

def costFunction(RDD, metric = 'Entropy'):
    
    """
        Input: 
            RDD    : distributed database in format (class, [features])
            metric : metric to be returned {Gini or Entropia}
        
        Output:
            tuple  : (metric, input RDD size) 
    """ 
    
    if RDD.count() > 0:

        labelOne, n = ( RDD
                        .map(lambda sample : (sample[0],1))
                        .reduce(lambda x,y : (x[0]+y[0],x[1]+y[1])) )

        p1 = labelOne/n
        p0 = 1 - p1

        if metric == 'Gini':
            metric = 1 - (p0**2 + p1**2)
        elif metric == 'Entropy':
        
        #lim x.log(x), x-->0 = 0
            if p0 == 0 or p1 == 0:
                metric = 0
            else:
                metric = (-1)*(p0*log(p0,2) + p1*log(p1,2))

        return (metric, n)   
    
    
    else:

        return (0, 0) 

With those values the probability of class 1 (p1) is found, the probability of class 0 (p0) is (1 - p1) then entropy can be calculated. This function is called by the non-parallel function infoGain(parentRDD, childsList) which calculates the information gain based on the entropies of a parent node and a list of its children.

def infoGain(parentRDD, childsList):
    
    """
        Input: 
            parentRDD  : parent node distributed database in format (class, [features])
            childsList : list of distributed database in format (class, [features]) for childres nodes
        
        Output:
            número real que representa o ganho de informação
    """    
    
    parentEntropy, parentN = costFunction(parentRDD,'Entropy')
    
    childsCosts = []
    
    childsEntropy = 0
    
    
    for child in childsList:
        childsCosts.append(costFunction(child, 'Entropy'))
                           
    
    for child in childsCosts:
        childsEntropy += child[0]*(child[1]/parentN)
                           
    
    return parentEntropy - childsEntropy