Methods such as kNN and Parzen-Windowing are called nonparametric. There are also nonparametric *Bayesian* methods such as Gaussian processes and Dirichlet processes. Ironically, these methods do have parameters that can be tuned. However, they distinguish themselves from parametric methods in the sense that the complexity of their hypothesis space (roughly, the number of essentially different hypotheses) grows as the number of data instances grow. This means that these models can not be compressed into a fixed set of parameters.

What about kernel machines? On the one hand, an SVM with a linear kernel just entertains a linear separation boundary. This must therefore be parametric. The primal formulation makes this clear. On the other hand, if we use kernels based on an infinite set of features, then kernel machines can grow their complexity based on the number of data instances (the situation is slightly complicated due to the fact that kernel methods need regularization terms.) This can intuitively be seen from the fact that prediction in this case looks a lot like NN: f(t) = sum_{SV:i} a_i K(t,x_i) y_i with K(.,.) the kernel and y_i the label for train-case i and a_i is the weight for support vector i obtained from solving the dual SVM optimization problem.

Now what if we have a kernel based on M > N features but we grow N to be bigger than M? We start of with something that behaves non-parametrically but then as N > M the modeling capacity is reached and the model behaves parametrically. So perhaps the boundary between parametric and non-parametric methods is not all that clear-cut after all.

## Thursday, September 29, 2011

## Monday, May 24, 2010

### Q: How can I perform sphereing when the covariance matrix is degenerate?

In this case you can use an SVD. For instance, in matlab do the following:

[U,S,V] = svd(X). This will give you matrices U,S,V such that X=U*S*V'.

Now define Y=sqrt(N)*inv(S)*U'*X, then Y is sphered.

Proof:

Y*Y'/N =

sqrt(N)*inv(S)*U'*X*X'*U*inv(S)*sqrt(N)/N =

inv(S)*U'*U*S*V'*V*S*U"*U*inv(S)=

I !!

[U,S,V] = svd(X). This will give you matrices U,S,V such that X=U*S*V'.

Now define Y=sqrt(N)*inv(S)*U'*X, then Y is sphered.

Proof:

Y*Y'/N =

sqrt(N)*inv(S)*U'*X*X'*U*inv(S)*sqrt(N)/N =

inv(S)*U'*U*S*V'*V*S*U"*U*inv(S)=

I !!

## Tuesday, May 11, 2010

### How does PCA distinguish different classes?

Well, it doesn't. PCA simply projects the data into a lower dimensional space using a linear projection. It completely ignores the labels of the data-cases (if there are any) in doing so. Instead it finds the directions of highest variance. It is an example of unsupervised learning. Kernel-PCA is more powerful in the sense that it can deal with nonlinear projections. Even better are the more recent nonlinear embedding methods such as isomap, LLE, maximum variance unfolding etc.

If you want to project in such a way that data-cases of different classes are maximally separated in the projection then another technique is available: Fisher Linear Discriminant Analysis (FLDA). Also in this case, kernelised versions are available.

If you want to project in such a way that data-cases of different classes are maximally separated in the projection then another technique is available: Fisher Linear Discriminant Analysis (FLDA). Also in this case, kernelised versions are available.

## Monday, May 3, 2010

### How do you choose the number of clusters for k-means?

In k-means, one usually assumes a given number of clusters. However, it is obviously very important to have some idea about how many clusters the data supports. The trouble is that we cannot simply try k=1,2,3,... and find the value of k which minimizes the k-means cost. Reason: higher k-values will always result in lower costs (ignoring local minima for the moment). Thus we need to use an other evaluation criterion. Cross-validation is a little hard because we have no task such as classification and no probabilistic model to measure the probability of validation data.

However, we can use the compression criterion discussed in class. Too many clusters would require us to send too many parameters (the centroids of the clusters), but too few would mean we have to send all the data instead of corrections to the mean. So we expect some optimal value of k, depending on the data and in particular on the size of the dataset (more data supports higher values of k).

Still it's a little cumbersome to figure out exactly what the compression rate is for a particular k-means solution. As a shortcut one can derive this MDL criterion ("minimum description length"):

MDL = Cost + 1/2 * (# parameters) * log(N)

For k-means we have d*k parameters because we have d real values in d dimensions for each mean. Note that this term grows with k, while the k-means Cost is expected to fall with k, so there is some optimal value of k. Moreover, "Cost" scales linearly with N while the extra penalty term grows logarithmic in N. This means that bigger datasets will support higher values of k.

However, we can use the compression criterion discussed in class. Too many clusters would require us to send too many parameters (the centroids of the clusters), but too few would mean we have to send all the data instead of corrections to the mean. So we expect some optimal value of k, depending on the data and in particular on the size of the dataset (more data supports higher values of k).

Still it's a little cumbersome to figure out exactly what the compression rate is for a particular k-means solution. As a shortcut one can derive this MDL criterion ("minimum description length"):

MDL = Cost + 1/2 * (# parameters) * log(N)

For k-means we have d*k parameters because we have d real values in d dimensions for each mean. Note that this term grows with k, while the k-means Cost is expected to fall with k, so there is some optimal value of k. Moreover, "Cost" scales linearly with N while the extra penalty term grows logarithmic in N. This means that bigger datasets will support higher values of k.

## Tuesday, April 20, 2010

### More On Chi-Square Tests for Decision Trees

There were some additional questions about using Chi-square tests for decision trees. I have found this excellent tutorial on the web: http://www.ics.uci.edu/~welling/teaching/273ASpring10/recitation4_decision_tree.pdf.

Some important point to keep in mind:

In class we only looked at one value out of two for the attribute, but in case there are more values it is easiest to simply compute the chi-square statistic for all values of the label, Y, and for all values of the attribute F. So, the statistic is now a double sum, one over Y-values and another one over possible F-values. It is important that you correct for this by selecting the correct number of degrees of freedom for the chi-square test, which is now given by: dof=(|F|-1)x(|Y|-1). For Y=2 and F=2, as before, we have dof=1. However, if one feature has many more F-values, it will need more dofs in the chi-square test and hence it will be automatically more penalized for having many choices. So, after you compute chi^2 using this double sum, you first check which feature has the smallest p-value. Then for the feature with the smallest p-value you ask if the null hypothesis is rejected (are the observed counts significantly different than what can be expected from random fluctuations around the expected values?). One usually rejects the null hypothesis for p<0.05. When you reject, you do not add any feature to the tree.

Some important point to keep in mind:

In class we only looked at one value out of two for the attribute, but in case there are more values it is easiest to simply compute the chi-square statistic for all values of the label, Y, and for all values of the attribute F. So, the statistic is now a double sum, one over Y-values and another one over possible F-values. It is important that you correct for this by selecting the correct number of degrees of freedom for the chi-square test, which is now given by: dof=(|F|-1)x(|Y|-1). For Y=2 and F=2, as before, we have dof=1. However, if one feature has many more F-values, it will need more dofs in the chi-square test and hence it will be automatically more penalized for having many choices. So, after you compute chi^2 using this double sum, you first check which feature has the smallest p-value. Then for the feature with the smallest p-value you ask if the null hypothesis is rejected (are the observed counts significantly different than what can be expected from random fluctuations around the expected values?). One usually rejects the null hypothesis for p<0.05. When you reject, you do not add any feature to the tree.

## Tuesday, April 13, 2010

### Information Gain Ratio

Q: why information gain ratio?

A: First let's see what it represents. The numerator is the information we learn about the label. The denominator however, represents the information we learn about the attribute (feature). Or in other words, the information necessary to specify the feature value. The idea is that features with very many possible states have an unfair advantage because in the extreme case, every data example will fall in a separate feature bin and we seemingly learn a lot about the label. So by dividing by the number of bits that we are learning about the feature we compensate in the right direction: features with high entropy will be divided by a larger number and will hence be penalized more.

However....ultimately, it is a big hack. Although the penalty is in the correct direction, the size of the effect cannot be derived in any principled way. Assume we have an infinite number of data-cases. We should then not have to worry about the entropy of the feature itself: we really care about how much we learn about the label. The issue comes into play for small data-set sizes. The reason is that for a small number of examples it is much easier to drop into separate feature bins if the number of bins is large, and so it looks like we are learning about the label while in reality it was just a random fluctuation. So, it is a clean case of overfitting. As we know, overfitting is a function of the number of examples in the training set. But the gain ration does not depend on the number of training examples! Conclusion: it's a hack and you need to be careful with it.

It would be much better to think of this problem in the context of overfitting directly. We not only want to ask ourselves which of the features reveals more about the label (if we could extrapolate to an infinite number of data-cases) but we need to ask ourselves if adding a feature is a good idea at all (i.e. would lead to overfitting). For this one can try to compute the probability that the feature does not reveal any information at all (i.e. an equal number of examples of both classes in each feature bin when the number of examples is taken to be very large). The observed information gain must then be a small sample effect. Note that this effect is much stronger for features with many states because it will be much easier to explain the computed information gain by a random fluctuation. Let's assume for a moment that we can compute:

P: the probability that the observed information gain is real (i.e. not a random fluke).

I would propose the following measure to be a more reasonable test between features:

Expected Information-Gain = P * Information-Gain

Note that P=P(n) is a function of the number of data-cases and for n=infinity we have P=1: any observed information is real. However when n is small P will be close to 0 and the expected information gain will be small. If we simply require that the expected information gain is larger than a some threshold (which we perhaps choose by cross-validation) then we have a principled measure to pick the next best feature (or not not pick any feature at all).

A: First let's see what it represents. The numerator is the information we learn about the label. The denominator however, represents the information we learn about the attribute (feature). Or in other words, the information necessary to specify the feature value. The idea is that features with very many possible states have an unfair advantage because in the extreme case, every data example will fall in a separate feature bin and we seemingly learn a lot about the label. So by dividing by the number of bits that we are learning about the feature we compensate in the right direction: features with high entropy will be divided by a larger number and will hence be penalized more.

However....ultimately, it is a big hack. Although the penalty is in the correct direction, the size of the effect cannot be derived in any principled way. Assume we have an infinite number of data-cases. We should then not have to worry about the entropy of the feature itself: we really care about how much we learn about the label. The issue comes into play for small data-set sizes. The reason is that for a small number of examples it is much easier to drop into separate feature bins if the number of bins is large, and so it looks like we are learning about the label while in reality it was just a random fluctuation. So, it is a clean case of overfitting. As we know, overfitting is a function of the number of examples in the training set. But the gain ration does not depend on the number of training examples! Conclusion: it's a hack and you need to be careful with it.

It would be much better to think of this problem in the context of overfitting directly. We not only want to ask ourselves which of the features reveals more about the label (if we could extrapolate to an infinite number of data-cases) but we need to ask ourselves if adding a feature is a good idea at all (i.e. would lead to overfitting). For this one can try to compute the probability that the feature does not reveal any information at all (i.e. an equal number of examples of both classes in each feature bin when the number of examples is taken to be very large). The observed information gain must then be a small sample effect. Note that this effect is much stronger for features with many states because it will be much easier to explain the computed information gain by a random fluctuation. Let's assume for a moment that we can compute:

P: the probability that the observed information gain is real (i.e. not a random fluke).

I would propose the following measure to be a more reasonable test between features:

Expected Information-Gain = P * Information-Gain

Note that P=P(n) is a function of the number of data-cases and for n=infinity we have P=1: any observed information is real. However when n is small P will be close to 0 and the expected information gain will be small. If we simply require that the expected information gain is larger than a some threshold (which we perhaps choose by cross-validation) then we have a principled measure to pick the next best feature (or not not pick any feature at all).

## Thursday, April 1, 2010

### Transductive Learning

Q: I've come across transductive learning while I was reading about the classification task in ML. At first I thought it was just another name for semi-supervised learning, but then some people claim that the key difference is that in transductive learning the point is not to construct a classification function f, but rather to "transfer" the information from labeled examples to unlabeled. Can you elaborate more on transductive learning please? Is it really different from semi-supervised learning? In which settings is it proper?

A: What I understand about transductive learning is that you are given beforehand the data-cases on which you are going to be tested. If we denote by Xtrn the training features and Ytrn the training labels, then supervised learning learns a function Y=F(X) assuming perhaps that the test cases will be sampled from the same distribution as the training examples: [Xtst,Ytrn]~P, [Xtrn,Ytst]~P. However, in transductive learning you know more: you are given Xtst explicitly. This extra information can be used to improve performance on just those Xtst. Semi-supervised learning uses both labeled and unlabeled training data to build a classifier on any test data. So the difference lies in what you will be tested on. For a more in-depth discussion please read: http://www.kyb.tuebingen.mpg.de/ssl-book/discussion.pdf

A: What I understand about transductive learning is that you are given beforehand the data-cases on which you are going to be tested. If we denote by Xtrn the training features and Ytrn the training labels, then supervised learning learns a function Y=F(X) assuming perhaps that the test cases will be sampled from the same distribution as the training examples: [Xtst,Ytrn]~P, [Xtrn,Ytst]~P. However, in transductive learning you know more: you are given Xtst explicitly. This extra information can be used to improve performance on just those Xtst. Semi-supervised learning uses both labeled and unlabeled training data to build a classifier on any test data. So the difference lies in what you will be tested on. For a more in-depth discussion please read: http://www.kyb.tuebingen.mpg.de/ssl-book/discussion.pdf

Subscribe to:
Posts (Atom)