Monday, May 24, 2010

Q: How can I perform sphereing when the covariance matrix is degenerate?

In this case you can use an SVD. For instance, in matlab do the following:
[U,S,V] = svd(X). This will give you matrices U,S,V such that X=U*S*V'.

Now define Y=sqrt(N)*inv(S)*U'*X, then Y is sphered.

Proof:
Y*Y'/N =
sqrt(N)*inv(S)*U'*X*X'*U*inv(S)*sqrt(N)/N =
inv(S)*U'*U*S*V'*V*S*U"*U*inv(S)=
I !!

Tuesday, May 11, 2010

How does PCA distinguish different classes?

Well, it doesn't. PCA simply projects the data into a lower dimensional space using a linear projection. It completely ignores the labels of the data-cases (if there are any) in doing so. Instead it finds the directions of highest variance. It is an example of unsupervised learning. Kernel-PCA is more powerful in the sense that it can deal with nonlinear projections. Even better are the more recent nonlinear embedding methods such as isomap, LLE, maximum variance unfolding etc.

If you want to project in such a way that data-cases of different classes are maximally separated in the projection then another technique is available: Fisher Linear Discriminant Analysis (FLDA). Also in this case, kernelised versions are available.

Monday, May 3, 2010

How do you choose the number of clusters for k-means?

In k-means, one usually assumes a given number of clusters. However, it is obviously very important to have some idea about how many clusters the data supports. The trouble is that we cannot simply try k=1,2,3,... and find the value of k which minimizes the k-means cost. Reason: higher k-values will always result in lower costs (ignoring local minima for the moment). Thus we need to use an other evaluation criterion. Cross-validation is a little hard because we have no task such as classification and no probabilistic model to measure the probability of validation data.

However, we can use the compression criterion discussed in class. Too many clusters would require us to send too many parameters (the centroids of the clusters), but too few would mean we have to send all the data instead of corrections to the mean. So we expect some optimal value of k, depending on the data and in particular on the size of the dataset (more data supports higher values of k).

Still it's a little cumbersome to figure out exactly what the compression rate is for a particular k-means solution. As a shortcut one can derive this MDL criterion ("minimum description length"):

MDL = Cost + 1/2 * (# parameters) * log(N)

For k-means we have d*k parameters because we have d real values in d dimensions for each mean. Note that this term grows with k, while the k-means Cost is expected to fall with k, so there is some optimal value of k. Moreover, "Cost" scales linearly with N while the extra penalty term grows logarithmic in N. This means that bigger datasets will support higher values of k.