A Big Data Approach to Decision Trees - Cost Function
Published:
costFunction(RDD, metric = 'Entropy')
:
This function receives a (class, [features])
RDD and returns its entropy based on a binary class. It maps the RDD to (class, 1) like tuples then reduces it by adding samples. As class values are {0,1}
, its sum gives the the number of samples of class 1 and the second element of the tuple is number of samples in the dataset.
def costFunction(RDD, metric = 'Entropy'):
"""
Input:
RDD : distributed database in format (class, [features])
metric : metric to be returned {Gini or Entropia}
Output:
tuple : (metric, input RDD size)
"""
if RDD.count() > 0:
labelOne, n = ( RDD
.map(lambda sample : (sample[0],1))
.reduce(lambda x,y : (x[0]+y[0],x[1]+y[1])) )
p1 = labelOne/n
p0 = 1 - p1
if metric == 'Gini':
metric = 1 - (p0**2 + p1**2)
elif metric == 'Entropy':
#lim x.log(x), x-->0 = 0
if p0 == 0 or p1 == 0:
metric = 0
else:
metric = (-1)*(p0*log(p0,2) + p1*log(p1,2))
return (metric, n)
else:
return (0, 0)
With those values the probability of class 1 (p1) is found, the probability of class 0 (p0) is (1 - p1) then entropy can be calculated. This function is called by the non-parallel function infoGain(parentRDD, childsList)
which calculates the information gain based on the entropies of a parent node and a list of its children.
def infoGain(parentRDD, childsList):
"""
Input:
parentRDD : parent node distributed database in format (class, [features])
childsList : list of distributed database in format (class, [features]) for childres nodes
Output:
número real que representa o ganho de informação
"""
parentEntropy, parentN = costFunction(parentRDD,'Entropy')
childsCosts = []
childsEntropy = 0
for child in childsList:
childsCosts.append(costFunction(child, 'Entropy'))
for child in childsCosts:
childsEntropy += child[0]*(child[1]/parentN)
return parentEntropy - childsEntropy