There were some additional questions about using Chi-square tests for decision trees. I have found this excellent tutorial on the web: http://www.ics.uci.edu/~welling/teaching/273ASpring10/recitation4_decision_tree.pdf.
Some important point to keep in mind:
In class we only looked at one value out of two for the attribute, but in case there are more values it is easiest to simply compute the chi-square statistic for all values of the label, Y, and for all values of the attribute F. So, the statistic is now a double sum, one over Y-values and another one over possible F-values. It is important that you correct for this by selecting the correct number of degrees of freedom for the chi-square test, which is now given by: dof=(|F|-1)x(|Y|-1). For Y=2 and F=2, as before, we have dof=1. However, if one feature has many more F-values, it will need more dofs in the chi-square test and hence it will be automatically more penalized for having many choices. So, after you compute chi^2 using this double sum, you first check which feature has the smallest p-value. Then for the feature with the smallest p-value you ask if the null hypothesis is rejected (are the observed counts significantly different than what can be expected from random fluctuations around the expected values?). One usually rejects the null hypothesis for p<0.05. When you reject, you do not add any feature to the tree.