# A Big Data Approach to Decision Trees - Cost Function

Published:

costFunction(RDD, metric = 'Entropy'):

This function receives a (class, [features]) RDD and returns its entropy based on a binary class. It maps the RDD to (class, 1) like tuples then reduces it by adding samples. As class values are {0,1}, its sum gives the the number of samples of class 1 and the second element of the tuple is number of samples in the dataset.

def costFunction(RDD, metric = 'Entropy'):

"""
Input:
RDD    : distributed database in format (class, [features])
metric : metric to be returned {Gini or Entropia}

Output:
tuple  : (metric, input RDD size)
"""

if RDD.count() > 0:

labelOne, n = ( RDD
.map(lambda sample : (sample[0],1))
.reduce(lambda x,y : (x[0]+y[0],x[1]+y[1])) )

p1 = labelOne/n
p0 = 1 - p1

if metric == 'Gini':
metric = 1 - (p0**2 + p1**2)
elif metric == 'Entropy':

#lim x.log(x), x-->0 = 0
if p0 == 0 or p1 == 0:
metric = 0
else:
metric = (-1)*(p0*log(p0,2) + p1*log(p1,2))

return (metric, n)

else:

return (0, 0)


With those values the probability of class 1 (p1) is found, the probability of class 0 (p0) is (1 - p1) then entropy can be calculated. This function is called by the non-parallel function infoGain(parentRDD, childsList) which calculates the information gain based on the entropies of a parent node and a list of its children.

def infoGain(parentRDD, childsList):

"""
Input:
parentRDD  : parent node distributed database in format (class, [features])
childsList : list of distributed database in format (class, [features]) for childres nodes

Output:
número real que representa o ganho de informação
"""

parentEntropy, parentN = costFunction(parentRDD,'Entropy')

childsCosts = []

childsEntropy = 0

for child in childsList:
childsCosts.append(costFunction(child, 'Entropy'))

for child in childsCosts:
childsEntropy += child[0]*(child[1]/parentN)

return parentEntropy - childsEntropy