I’m refreshing my data mining skills and thought it could be fun to do the Digit Recognizer competition over at Kaggle. Each datapoint in that dataset is an image of a hand drawn digit from zero to nine. Our task is to determine which digit each of them represent. The problem is that each pixel represents a feature making it a huge number of them. Machine learning algorithms really don’t like large feature spaces due to the risk of overfitting. I was going to apply some sort of feature reduction technique, and I remembered SVD being awesome for this type of jobs.
To my surprise the resources I found online didn’t really bring the task down to earth. I spent some time to “get it” again and thought it could be a worthy blog post. Remember, this isn’t meant to be rigorous in any way. It’s just an attempt at making a practical example. If you’re interested in understanding how SVD works I’d suggest any of the other brilliant resources out there.
I’ll be assuming some familiarity with matrix notation and R code. If you need a brush up on matrix notation I’ve written a post about it before.
Alright, lets get started. We’ve got a dataset \(X \in \mathbb{R}^{m\times n}\) and labels \(\vec{y} \in \{0,1,…,9\}^{m}\) where \(m\) is the number of datapoints and \(n\) is the number of features. The datapoints are typically denoted as lower case \(\vec{x} \in \mathbb{R}^{n}\). For the database nuts, you can think of \(X\) as a table of \(m\) rows and \(n\) features where \(\vec{x}\) is a row. Each datapoint \(\vec{x}\) has a corresponding label \(y\) taking a value 0 to 9. These values can be found in \(\vec{y}\). What we want to do is to make a good classifier \(f\) which learns the patterns of \(X\) such that when we give it a new datapoint \(\vec{z} \in \mathbb{R}^{n}\) with unknown label we can estimated (i.e., guess) the value \(\hat{y} \in \{0,1,…,9\}\).
What we’ve got then is a classifier which does the following:
\(f(\vec{z}; X) = \hat{y}\)
Problem: There are too many features. We need to reduce \(n\) somehow. Singular value decomposition (SVD) can help us out. You can think of what it does as transforming the features of our datapoints into a new feature set with increasing importance. This last point is important because it means we can truncate the less important features. Actually this is how many lossy image compression algorithms work since you are removing data, while retaining as much of the information as possible. What you get from performing SVD on your dataset is two matrices \(U \in \mathbb{R}^{m\times n}\) and \(V \in \mathbb{R}^{n\times n}\) in addition to a vector \(\vec{d} \in \mathbb{R}^{n}\). Typically this last vector gets represented as a diagonal matrix \(D = \mathrm{diag}(\vec{d}) \in \mathbb{R}^{n\times n}\), meaning each element of \(\vec{d}\) is in the top-left to bottom-right diagonal of \(D\) and the rest of the entries are zero. These three matrices can losslessly be reverted to your initial dataset \(X\) using the following relationship:
\(X = UDV^\mathrm{T}\).
What is cool about this is that we can remove columns from the right of \(U\) and \(V\), as well as the bottom-right column-row pairs of \(D\), and still maintain a good approximation of \(X\). One important caveat: \(X\) must be centered. The average of each feature must be zero. You do this by subtracting the average feature value from each respective element of each datapoint. I’m not going to represent this in matrix notation, but you can see how to go about doing this in the R code further below.

Image is a modified version of one found at semanticquery.com
In the above image, the gray part is what we retain. Also, don’t get confused, the \(V\) is transposed thus the part which is retained looks like a rows in the image.
Alright, let’s give these smaller matrices names:
\(\hat{X} = U_rD_rV_r^\mathrm{T}\)
Here \(\hat{X} \in \mathbb{R}^{m\times n}\) is an approximate version of our original dataset \(X\), while the subscript \(r\) indicates that the matrix has “reduced” size as described above. What we got now is \(U_r \in \mathbb{R}^{m\times k}\), \(D_r = \mathrm{diag}(d) \in \mathbb{R}^{k\times k}\) and \(V_r \in \mathbb{R}^{k\times n}\) where \(k \in \{1,\dots,n\}\), i.e., we are planning to retaining \(k\) features from the original \(n\) features.
Alright, all of this seem fine and dandy, but where is the new dataset we are going plug into our classifier. I asked myself the same question. Without fail, every post I found skipped this part, until I found this answer at stats.stackexchange.com. The answer is to use the \(U_rD_r\), called the principle components or “score”. What we end up with is
\(X_r = U_rD_r\)
Again, I use the subscript \(r\) to indicate a “reduced” matrix, in this case we now have \(X_r \in \mathbb{R}^{m\times k}\). That’s it! Case closed? No, not entirely. The following won’t work:
\(f(\vec{z}; X_r) = \hat{y}\)
This is because your unlabeled \(\vec{z} \in \mathbb{R}^{n}\) must have the same number of features as \(X_r\). It very much doesn’t. \(X_r\) has \(k\) features, while our unlabeled datapoint \(\vec{z}\) still has \(n\) features. We need to transform the unlabeled datapoints we want to classify. Let’s say we have a a bunch (\(l\)) of unlabeled datapoints in a dataset \(Z \in \mathbb{R}^{l\times n}\) then we can transform it using the \(V_r\) matrix as follows:
\(Z_r = ZV_r\)
If you’r streaming datapoints and need to do predictions one-by-one you can do the same as follows
\(z_r^\mathrm{T} = z^\mathrm{T}V_r\)
or the equivalent
\(z_r = V_r^\mathrm{T}z\)
Thus, we have what we need to do predictions with the feature reduced dataset:
\(f(\vec{z}_r; X_r) = \hat{y}\)
I did say we were doing this in R, and I’ll stick to this promise. The following outlines how to do it on dummy data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
hilbert <- function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") } X_all_size = 50 n = 6 # Original number of features k = 2 # Reduced number of features split_ratio = 0.8 #G enerate some dummy data X_all <- hilbert(X_all_size)[,1:n] # Split data into training and test sets idx <- sample(seq_len(nrow(X_all)), size = floor(split_ratio * nrow(X_all))) X <- X_all[idx, ] #Training set Z <- X_all[-idx, ] #Test set # Center the features using the average of features in our training dataset X. x_bar <- colMeans(X) # Average of each feature X <- sweep(X, MARGIN=2, x_bar, FUN="-") # Centering features of X using average of X. Z <- sweep(Z, MARGIN=2, x_bar, FUN="-") # Centering features of Z using average of X. # It's important to use the average of the X features and not the test set Z. # You don't have access to Z when doing one-by-one predictions. # Perform singular value decomposition s <- svd(X) # Reduce number of features from n=6 to k=2 U_r <- s$u[,1:k] D_r <- diag(s$d[1:k]) V_r <- s$v[,1:k] # Transfroming to a feature reduced version of the training set X and test set Z X_r <- U_r %*% D_r Z_r <- Z %*% V_r # Then comes use of the new feature set for Z and X. # Y_hat = f(Z_r; X_r) # Outputs predictions # Finally, you can reransfrom back to approximations of the original training set X and test set Z X_hat <- X_r %*% t(V_r) # == U_r %*% D_r %*% t(V_r) Z_hat <- Z_r %*% t(V_r) # This last part isn't nessary. |
Later I will follow up with a post on how to do it using the Digit Recognizer dataset from Kaggle.