Tuesday, April 20, 2010

More On Chi-Square Tests for Decision Trees

There were some additional questions about using Chi-square tests for decision trees. I have found this excellent tutorial on the web: http://www.ics.uci.edu/~welling/teaching/273ASpring10/recitation4_decision_tree.pdf.

Some important point to keep in mind:
In class we only looked at one value out of two for the attribute, but in case there are more values it is easiest to simply compute the chi-square statistic for all values of the label, Y, and for all values of the attribute F. So, the statistic is now a double sum, one over Y-values and another one over possible F-values. It is important that you correct for this by selecting the correct number of degrees of freedom for the chi-square test, which is now given by: dof=(|F|-1)x(|Y|-1). For Y=2 and F=2, as before, we have dof=1. However, if one feature has many more F-values, it will need more dofs in the chi-square test and hence it will be automatically more penalized for having many choices. So, after you compute chi^2 using this double sum, you first check which feature has the smallest p-value. Then for the feature with the smallest p-value you ask if the null hypothesis is rejected (are the observed counts significantly different than what can be expected from random fluctuations around the expected values?). One usually rejects the null hypothesis for p<0.05. When you reject, you do not add any feature to the tree.

Tuesday, April 13, 2010

Information Gain Ratio

Q: why information gain ratio?

A: First let's see what it represents. The numerator is the information we learn about the label. The denominator however, represents the information we learn about the attribute (feature). Or in other words, the information necessary to specify the feature value. The idea is that features with very many possible states have an unfair advantage because in the extreme case, every data example will fall in a separate feature bin and we seemingly learn a lot about the label. So by dividing by the number of bits that we are learning about the feature we compensate in the right direction: features with high entropy will be divided by a larger number and will hence be penalized more.

However....ultimately, it is a big hack. Although the penalty is in the correct direction, the size of the effect cannot be derived in any principled way. Assume we have an infinite number of data-cases. We should then not have to worry about the entropy of the feature itself: we really care about how much we learn about the label. The issue comes into play for small data-set sizes. The reason is that for a small number of examples it is much easier to drop into separate feature bins if the number of bins is large, and so it looks like we are learning about the label while in reality it was just a random fluctuation. So, it is a clean case of overfitting. As we know, overfitting is a function of the number of examples in the training set. But the gain ration does not depend on the number of training examples! Conclusion: it's a hack and you need to be careful with it.

It would be much better to think of this problem in the context of overfitting directly. We not only want to ask ourselves which of the features reveals more about the label (if we could extrapolate to an infinite number of data-cases) but we need to ask ourselves if adding a feature is a good idea at all (i.e. would lead to overfitting). For this one can try to compute the probability that the feature does not reveal any information at all (i.e. an equal number of examples of both classes in each feature bin when the number of examples is taken to be very large). The observed information gain must then be a small sample effect. Note that this effect is much stronger for features with many states because it will be much easier to explain the computed information gain by a random fluctuation. Let's assume for a moment that we can compute:

P: the probability that the observed information gain is real (i.e. not a random fluke).

I would propose the following measure to be a more reasonable test between features:

Expected Information-Gain = P * Information-Gain

Note that P=P(n) is a function of the number of data-cases and for n=infinity we have P=1: any observed information is real. However when n is small P will be close to 0 and the expected information gain will be small. If we simply require that the expected information gain is larger than a some threshold (which we perhaps choose by cross-validation) then we have a principled measure to pick the next best feature (or not not pick any feature at all).

Thursday, April 1, 2010

Transductive Learning

Q: I've come across transductive learning while I was reading about the classification task in ML. At first I thought it was just another name for semi-supervised learning, but then some people claim that the key difference is that in transductive learning the point is not to construct a classification function f, but rather to "transfer" the information from labeled examples to unlabeled. Can you elaborate more on transductive learning please? Is it really different from semi-supervised learning? In which settings is it proper?

A: What I understand about transductive learning is that you are given beforehand the data-cases on which you are going to be tested. If we denote by Xtrn the training features and Ytrn the training labels, then supervised learning learns a function Y=F(X) assuming perhaps that the test cases will be sampled from the same distribution as the training examples: [Xtst,Ytrn]~P, [Xtrn,Ytst]~P. However, in transductive learning you know more: you are given Xtst explicitly. This extra information can be used to improve performance on just those Xtst. Semi-supervised learning uses both labeled and unlabeled training data to build a classifier on any test data. So the difference lies in what you will be tested on. For a more in-depth discussion please read: http://www.kyb.tuebingen.mpg.de/ssl-book/discussion.pdf