Large-Scale Supervised Sparse Principal Component Analysis

The sparse PCA is a variant of the classical PCA, which assumes sparsity in the feature space. It has several advantages such as easy to interpret, and works for really high-dimensional data. The main issue about sparse PCA is that it is computationally expensive.

Contents

 [hide]
  • 1 Introduction
  • 2 Primal problem
  • 3 Block Coordinate Ascent Algorithm
  • 4 Numerical examples
  • 5 Conclusion
  • 6 Discussions

Introduction

The sparse PCA is a variant of the classical PCA, which assumes sparsity in the feature space. It has several advantages such as easy to interpret, and works for really high-dimensional data. The main issue about sparse PCA is that it is computationally expensive. Many algorithms have been proposed to solve the sparse PCA problem, and the authors introduced a fast block coordinate ascent algorithm with much better computational complexity.

1 Drawbacks of Existing techniques

Existing techniques include ad-hoc methods(e.g. factor rotation techniques, simple thresholding), greedy algorithms, SCoTLASS, the regularized SVD method, SPCA, the generalized power method. These methods are based on non-convex optimization and they don't guarantee global optimum.

A semi-definite relaxation method called DSPCA can guarantee global convergence and has better performance than above algorithms, however, it is computationally expensive.

2 Contribution of this paper

This paper solves DSPCA in a computationally easier way, and hence it is a good solution for large scale data sets. This paper applies a block coordinate ascent algorithm with computational complexity O(\hat{n^3}), where \hat{n} is the intrinsic dimension of the data. Since \hat{n} could be very small compared to the dimension n of the data, this algorithm is computationally easy.

Primal problem

The sparse PCA problem can be formulated as max_x \  x^T \Sigma x - \lambda \| x \|_0 : \| x \|_2=1.

This is equivalent to max_z \  Tr(\Sigma Z) - \lambda \sqrt{\| Z \|_0} : Z \succeq 0, Tr Z=1, Rank(Z)=1.

Replacing the \sqrt{\| Z \|_0} with \| Z \|_1 and dropping the rank constraint gives a relaxation of the original non-convex problem:

\phi = max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0Tr(Z)=1 \qquad (1) .

Fortunately, this relaxation approximates the original non-convex problem to a convex problem.

Here is an important theorem used by this paper:

Theorem(2.1) Let Σ = ATA where A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n}, we have \psi = max_{\| \xi \|_2=1} \sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+. An optimal non-zero pattern corresponds to the indices i with \lambda < (({a_i}^T \xi)^2-\lambda)_+ at optimum.

An important observation is that the i-th feature is absent at optimum if (a_i^T\xi)^2\leq \lambda for every \xi,\Vert \xi \Vert_2=1. Hence, the feature i with \Sigma_{ii}=a_i^Ta_i<\lambda can be safely removed.

Block Coordinate Ascent Algorithm

There is a row-by-row algorithm applied to the problems of the form min_X \ f(X)-\beta \  log(det X): \ L \leq X \leq U, X \succ 0.

Problem (1) can be written as {\frac 1 2} {\phi}^2 = max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2: X \succeq 0 \qquad (2) .

In order to apply the row by row algorithm, we need to add one more term \beta \ log(det X) to (2) where β > 0 is a penalty parameter.

That is to say, we address the problem \ max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2 + \beta \ log(det X): X \succeq 0 \qquad (3)

By matrix partitioning, we could obtain the sub-problem:

\phi = max_{x,y} \ 2(y^T s- \lambda \| y \|_1) +(\sigma - \lambda)x - {\frac 1 2}(t+x)^2 + \beta \ log(x-y^T Y^{\dagger} y ):y \in R(Y) \qquad (4).

By taking the dual of (4), the sub-problem can be simplified to be

 {\phi}^' = min_{u,z} {\frac 1 {\beta z}} u^T Yu - \beta (log z) + {\frac 1 2} (c+ \beta z)^2 : z>0, \| u-s \|_\infty \leq \lambda

Since β is very small, and we want to avoid large value of z, we could change variable r = βz, then the optimization problem become

 {\phi}^' - \beta (log \beta) = min_{u,r} {\frac 1 r} u^T Yu - \beta (log r) + {\frac 1 2} (c+r)^2 : r>0, \| u-s \|_\infty \leq \lambda \qquad (5)

We can solve the sub-problem (5) by first the box constraint QP

R^2 := min_u  u^T Yu : \| u - s \|_\infty \leq \lambda

and then set r by solving

 min_{r>0}  {\frac {R^2} r} - \beta (log r) + {\frac 1 2} (c+r)^2

Once the above sub-problem is solved, we can obtain the primal variables y,x by setting  y= {\frac 1 r} Y u and for the diagonal element x we have x = c + r = σ − λ − t + r

Here is the algorithm:

Algorithm.jpg

Convergence and complexity

1. The algorithm is guaranteed to converge to the global optimizer.

2. The complexity for the algorithm is O(K \hat{n^3}), where K is the number of sweeps through columns (fixed, typically K = 5), and (\hat{n^3}) is the intrinsic dimension of the data points.

Numerical examples

The algorithm is applied to two large text data sets. The NYTimes new articles data contains 300,000 articles and a dictionary of 102,660 unique words (1GB file size). And the PubMed data set has 8,200,000 abstracts with 141,043 unique words (7.8GB file size). They are too large for classical PCA to work.

The authors set the target cardinality for each principal component to be 5. They claim that it only takes the algorithm about 20 seconds to search for a proper range of lambda for such target cardinality. In the end, the block coordinate ascent algorithm works on a covariance matrix of order at most n=500, instead of 102,660 for NYTimes data and n=1000, instead of 141,043 for the PubMed data.

The top 5 sparse components for the two data sets are shown in the following tables. They claim that the sparse principle components still unambiguously identify and perfectly correspond to the topics used by The New York Times on its website.

SPCA1.png
SPCA2.png

Conclusion

1. The proposed algorithm can be applied to very large scale, real-life data sets;

2. The method works well when the target cardinality of the result is small;

3. The method can converge fast in practice.

Discussions

The paper used the semi-definite formulation of sparse PCA problem. This optimization problem was then solved using block coordinate ascent. The good thing about this formulation is that although computationally expensive, it will converge to the global optimum. However, multiple relaxations are used to reach the formulation itself.

The block-coordinate ascent algorithm is efficient and computationally tractable. However, the main source of tractability was the simple pre-processing method to eliminate features with low variance (which did not change the optimal solution).

One weakness of this algorithm is that it is not easy to extend this algorithm into kernel sparse PCA due to the constrain Tr(UTU) = 1, where Z = UTU, while in kernel PCA the constrain is Tr(UTK1U) = 1,where K1 is the Gram matrix.

Another weakness of this algorithm is that we assume the non-zero dimension of principal component is small. This is usually true in text application, while in other application this may not be the case.

RELATED ARTICLESExplain
Machine Learning Methods & Algorithms
Supervised learning
Large-Scale Supervised Sparse Principal Component Analysis
AODE - Averaged one-dependence estimators
Artificial neural network
Bayesian statistics
Case-based reasoning
Conditional Random Field
Decision tree learning
Ensemble learning
Gaussian process regression (Kriging)
Gene expression programming
Group method of data handling
Inductive logic programming
Information Fuzzy Networks (IFN)
Instance-based learning
Lazy learning
Learning automata
Learning Vector Quantization
Logistic Model Tree
Minimum message length
Minimum redundancy feature selection
Ordinal classification
Probably approximately correct learning
Random Forests
Regression analysis
Ripple-down rules
Statistical classification
Subsymbolic machine learning
Support vector machines
Symbolic machine learning
Graph of this discussion
Enter the title of your article


Enter a short (max 500 characters) summation of your article
Enter the main body of your article
Lock
+Comments (0)
+Citations (0)
+About
Enter comment

Select article text to quote
welcome text

First name   Last name 

Email

Skip