The main task for decision tree algorithms is to evaluate how good splitting a node by the values of a given attribute is then, recursively perform the division for the best option until leafs are reached. The Information Gain (G) is the metric for choosing the best option and, is given by equation below for a Set (S) in relation to an Attribute (A),
where Sv is each subgroup of S generated after splitting it by values of A, wv is the proportion between the sizes of Sv and S. Finally, the Entropy (I) is given by equation below,
where P(value_i) is the probability of each possible value of the class.
Textually, entropy represents how mixed a set is in relation to the class, in the binary case if almost all the instances are from only one class the entropy approaches zero however, if the set is almost half/half the metric approaches one .
Furthermore, the information gain is the difference between the entropy of the parent node and the weighted average of children entropies representing how good splitting at that attribute is for evaluating the dataset.