In k-means, one usually assumes a given number of clusters. However, it is obviously very important to have some idea about how many clusters the data supports. The trouble is that we cannot simply try k=1,2,3,... and find the value of k which minimizes the k-means cost. Reason: higher k-values will always result in lower costs (ignoring local minima for the moment). Thus we need to use an other evaluation criterion. Cross-validation is a little hard because we have no task such as classification and no probabilistic model to measure the probability of validation data.
However, we can use the compression criterion discussed in class. Too many clusters would require us to send too many parameters (the centroids of the clusters), but too few would mean we have to send all the data instead of corrections to the mean. So we expect some optimal value of k, depending on the data and in particular on the size of the dataset (more data supports higher values of k).
Still it's a little cumbersome to figure out exactly what the compression rate is for a particular k-means solution. As a shortcut one can derive this MDL criterion ("minimum description length"):
MDL = Cost + 1/2 * (# parameters) * log(N)
For k-means we have d*k parameters because we have d real values in d dimensions for each mean. Note that this term grows with k, while the k-means Cost is expected to fall with k, so there is some optimal value of k. Moreover, "Cost" scales linearly with N while the extra penalty term grows logarithmic in N. This means that bigger datasets will support higher values of k.