Q: why information gain ratio?
A: First let's see what it represents. The numerator is the information we learn about the label. The denominator however, represents the information we learn about the attribute (feature). Or in other words, the information necessary to specify the feature value. The idea is that features with very many possible states have an unfair advantage because in the extreme case, every data example will fall in a separate feature bin and we seemingly learn a lot about the label. So by dividing by the number of bits that we are learning about the feature we compensate in the right direction: features with high entropy will be divided by a larger number and will hence be penalized more.
However....ultimately, it is a big hack. Although the penalty is in the correct direction, the size of the effect cannot be derived in any principled way. Assume we have an infinite number of data-cases. We should then not have to worry about the entropy of the feature itself: we really care about how much we learn about the label. The issue comes into play for small data-set sizes. The reason is that for a small number of examples it is much easier to drop into separate feature bins if the number of bins is large, and so it looks like we are learning about the label while in reality it was just a random fluctuation. So, it is a clean case of overfitting. As we know, overfitting is a function of the number of examples in the training set. But the gain ration does not depend on the number of training examples! Conclusion: it's a hack and you need to be careful with it.
It would be much better to think of this problem in the context of overfitting directly. We not only want to ask ourselves which of the features reveals more about the label (if we could extrapolate to an infinite number of data-cases) but we need to ask ourselves if adding a feature is a good idea at all (i.e. would lead to overfitting). For this one can try to compute the probability that the feature does not reveal any information at all (i.e. an equal number of examples of both classes in each feature bin when the number of examples is taken to be very large). The observed information gain must then be a small sample effect. Note that this effect is much stronger for features with many states because it will be much easier to explain the computed information gain by a random fluctuation. Let's assume for a moment that we can compute:
P: the probability that the observed information gain is real (i.e. not a random fluke).
I would propose the following measure to be a more reasonable test between features:
Expected Information-Gain = P * Information-Gain
Note that P=P(n) is a function of the number of data-cases and for n=infinity we have P=1: any observed information is real. However when n is small P will be close to 0 and the expected information gain will be small. If we simply require that the expected information gain is larger than a some threshold (which we perhaps choose by cross-validation) then we have a principled measure to pick the next best feature (or not not pick any feature at all).