A Big Data Approach to Decision Trees - Splitting at The Best Feature
Published:
SplitByFeat(binRDD, feat_index)
:
Receives a (class, [features])
RDD and the index of a feature then, filters the dataset based on the value of given feature. Samples which are True
for the feature are returned as new RDD called left
and the samples which are False
as right
RDD.
def SplitByFeat(binRDD, feat_index):
"""
Input:
binRDD : distributed database in format (class, [features]) with features binarized
feat_index : index of the feature to split binRDD
Saída:
leftRDD : distributed database in format (class, [features]) with samples where choosen feature is TRUE
rightRDD : distributed database in format (class, [features]) with samples where choosen feature is FALSE
"""
#Filho da Esquerda -> o atributo dado é TRUE
leftRDD = binRDD.filter(lambda sample : sample[1][feat_index])
#Filho da Direita -> o atributo dado é FALSE
rightRDD = binRDD.filter(lambda sample : not sample[1][feat_index])
return leftRDD, rightRDD
This function is called by the non-parallel function evalAndSplit(binRDD)
that splits a node for every feature, calculates G for all and returns left
, right
, best split feature and related G.
def evalAndSplit(binRDD):
"""
Input:
binRDD : distributed database in format (class, [features]) with features binarized
Output:
leftRDD : distributed database in format (class, [features]) with samples where best option feature is TRUE
rightRDD : distributed database in format (class, [features]) with samples where best option feature is FALSE
bestOption : index of the feature with maximum information gain
maxIG : information gain for the bestOption feature
"""
candidatesIG = []
for i in range(len(binRDD.first()[1])):
candidatesIG.append(infoGain(binRDD,(SplitByFeat(binRDD, i))))
maxIG = max(candidatesIG)
bestOption = candidatesIG.index(maxIG)
left, right = SplitByFeat(binRDD, bestOption)
return left, right, bestOption, maxIG