A Big Data Approach to Decision Trees - Cost Function

1 minute read


costFunction(RDD, metric = 'Entropy'):

This function receives a (class, [features]) RDD and returns its entropy based on a binary class. It maps the RDD to (class, 1) like tuples then reduces it by adding samples. As class values are {0,1}, its sum gives the the number of samples of class 1 and the second element of the tuple is number of samples in the dataset.

def costFunction(RDD, metric = 'Entropy'):
            RDD    : distributed database in format (class, [features])
            metric : metric to be returned {Gini or Entropia}
            tuple  : (metric, input RDD size) 
    if RDD.count() > 0:

        labelOne, n = ( RDD
                        .map(lambda sample : (sample[0],1))
                        .reduce(lambda x,y : (x[0]+y[0],x[1]+y[1])) )

        p1 = labelOne/n
        p0 = 1 - p1

        if metric == 'Gini':
            metric = 1 - (p0**2 + p1**2)
        elif metric == 'Entropy':
        #lim x.log(x), x-->0 = 0
            if p0 == 0 or p1 == 0:
                metric = 0
                metric = (-1)*(p0*log(p0,2) + p1*log(p1,2))

        return (metric, n)   

        return (0, 0) 

With those values the probability of class 1 (p1) is found, the probability of class 0 (p0) is (1 - p1) then entropy can be calculated. This function is called by the non-parallel function infoGain(parentRDD, childsList) which calculates the information gain based on the entropies of a parent node and a list of its children.

def infoGain(parentRDD, childsList):
            parentRDD  : parent node distributed database in format (class, [features])
            childsList : list of distributed database in format (class, [features]) for childres nodes
            número real que representa o ganho de informação
    parentEntropy, parentN = costFunction(parentRDD,'Entropy')
    childsCosts = []
    childsEntropy = 0
    for child in childsList:
        childsCosts.append(costFunction(child, 'Entropy'))
    for child in childsCosts:
        childsEntropy += child[0]*(child[1]/parentN)
    return parentEntropy - childsEntropy