Principal Component Analysis

July 4, 2017 | Autor: 壮刘 | Categoría: Artificial Intelligence, Machine Learning, Data Mining

Descripción

Principal Component Analysis Haifeng Cao, Wei Ji, Zhuang Liu, Yanglin Zhou

Abstract

nition via PCA.

2. Dimensionality reduction

Dimensionality reduction is an efficient way to solve the curses of dimensionality. And there are many linear and nonlinear methods for dimensionality reduction. In this paper, we focus on the main linear method, principal component analysis. Principal component analysis (PCA) is probably the oldest and best known of the techniques of the multivariate analysis. The main idea is to find an orthogonal basis to represent the original data and keep the information of the original data as much as possible. We mainly discuss how and why PCA works. Then we give a simple simulation on face recognition via PCA.

The problem of dimensionality reduction can be defined as follows. Suppose we have a dataset consisting of n datavectors which can be represented by a D × n matrix X, and each column of X corresponds to a datavector with dimensionality D. Further, suppose the dataset has intrinsic dimensionality d(where d < D, usually d D). Dimensionality reduction methods transform dataset X with dimensionality D to a new dataset Y with dimensionality d, while retaining the geometry of the data as much as possible. The dimensionality reduction methods can be divided into linear methods and nonlinear methods.

1. Introduction

2.1. Linear methods

Nowadays, data comes in all types of formats, such as signals, images or videos. All these data usually have a high dimensionality. In order to deal with these data high-efficiency, we need to reduce their dimensionality. Dimensionality reduction is the transformation of highdimensional data into a meaningful representation of reduced dimensionality. In some cases, data analysis such as regression or classification can be done in the reduced space more accurately than in original space. As it is so powerful, a variety of methods have been proposed, which are divided into linear or nonlinear methods. The most famous and widely used in liner methods is PCA, which can be traced to 1901. In contrast to the traditional linear methods, the nonlinear methods have the ability to deal with complex nonlinear data. These methods have some advantages to handle the real world data, but most of them find their theoretical and algorithmic roots in PCA. In this paper, we give a detail understanding about how and why PCA works and the relationship between PCA and SVD. The remainder of this paper proceeds as follows. In section 2, we introduce the concept of dimensionality reduction and the main methods for dimensionality reduction both in linear or nonlinear. Section 3 is devoted to the detail of PCA. In section 4, we give a small simulation on face recog-

Linear methods transform the original data in a linear way. The two well-known linear methods are PCA and LDA (Linear Discriminant Analysis). LDA is to find a linear combination of features which characterizes or separates two or more classes of objects or events. And PCA will be told in detail later.

2.2. Nonlinear methods In contrast to linear methods, nonlinear methods will be more complex. Nonlinear methods can be subdivided into three main types: (1)methods that attempt to preserve global properties of the original data in the low-dimensional representation, such as MDS, Kernel PCA, (2)methods that attempt to preserve local properties of the original data in the low-dimensional representation, such as LLE, Laplacian Eigenmaps, and (3)methods that perform global alignment of a mixture of linear models, such as LLC.

3. Principal Component Analysis In this section, we mainly discuss how and why PCA works. As it was mentioned before, the main idea of PCA is to find an orthogonal basis to represent the original data and keep the information of the original data as much as 1

1 YYT n 1 = (P X)(P X)T n 1 = P ( XX T )P, n

possible. The hope is that this new basis will filter out the noise and reveal hidden structure. From the typical procedure of PCA, we can find that SVD can provide a better way to get the transform matrix.

CY =

3.1. Formulation then we get

Is there another basis, which is a linear combination of the original basis, that best re-expresses our data set? This is what PCA asks, and the question can be formulated as follows: PX = Y

CY = P CX P T . Since CX is a symmetric matrix, it can be diagonalized by a matrix of its orthonormal eigenvectors. Suppose the matrix is U , then we have CX = U DU T , where D is a diagonal matrix with diag(D) = {λ1 , . . . , λm } and U U T = I. We choose P ≡ U T , then a magic result happens:

Here, X is the original data set, a m × n matrix, where each column is a single sample of the data. Y is a new representation of the data set X. P can have many interpretations: (1)P is a matrix that transforms X into Y, (2)Geometrically, P is a rotation and a stretch which again transforms X into Y, (3)The row of P, {p1 , . . . , pk } are a set of new basis vectors for expressing the columns of X.

CY = P CX P T = U T (U DU T )U = D, that’s exactly what we want. A close reader might have noticed the P is also m × m, and it comes out that Y is the same dimensionality as X. How it can reduce the dimensionality? We will show it later.

3.2. How to choose P In the theory of probability and statistics, we know that the covariance measures the degree of the linear relationship between two variables. The covariance between A = {a1 , . . . , an } and B = {b1 , . . . , bn } with zero means is: 2 σAB

3.3. Dimensionality reduction with PCA In previous subsection, we get the transform matrix P which can transform our original data set X to a new data Y and reaches our goal that CY is diagonilized and each successive dimension in Y is rank-ordered according to variance. As we have mentioned, the variances reflect the important of the dimensionality. we can keep k eigenvectors corresponding to the most k largest variances, that is P = [u1 , . . . , uk ]T . In general, one can find the number of principal components for PCA as kˆ = mink {k : λk < τ } With a conclusion, we have a original data set represent¯ vector). ed by a m × n matrix X and its mean data is X(a ¯ We apply PCA on the fixed data X(f or each column)− X, then we get the transform matrix P with k × m dimensionality and a new representation matrix Y with k × n dimen¯ to sionality. Then we can just store the matrix P, Y and X keep the most information of original data. That’s a way to reduce the dimensionality. To reconstruct the original data ¯ is given by P Y (f or each column) + X.

1X ai bi , = n i

then we can get the covariance matrix CX of the original data X, of this form CX =

1 XX T , n

here should be mentioned that each row of X is processed with zero means in order to simplify the formulation of the covariance matrix. CX is a square symmetric m × m matrix. In the diagonal terms, each value is the variance of particular dimensionality, which large value correspond to interesting structure. Off-diagonal terms are the covariance between different dimensionality, which large magnitudes correspond to high noise and redundancy. Our goal is to minimize the redundancy and maximize the signal. That is, after transform X to Y, the covariance matrix of Y(note CY ) should be a diagonal matrix and each successive dimension in Y should be rank-ordered according to variance. We will show how the transform matrix P comes as follows:

3.4. The relationship between SVD and PCA As we all know, SVD(Singular value decomposition) is one of the most important techniques in matrix decomposition. It is widely used in signal processing and statistics. The tradition way of finding the transform matrix P in PCA is to calculate the eigenvectors of the covariance matrix. And this can be done by SVD since the fixed data can be decomposed to X = U ΣV T . U is also the eigenvectors of covariance matrix.

Y = PX 1 CX = XX T n

2

We have also done some reach, and find that SVD can provide a much better sense numerically than forming the covariance matrix to begin with, since the formation of XX T can cause loss of precision. An example is the matrix:   1 1 1  0 0   0 0 0 0

Figure 2. Mean face and the first two eigenfaces.

We also show that the first two eigenfaces mainly encode how the appearence of the face varies along the horizontal and vertical lighting directions, respectively.(See Figure 3)

3.5. Limits of PCA Though PCA is a powerful method to reduce the dimensionality, it has many assumptions. (1)Linearity. As you can see, PCA is a linear transform of the original data set. (2)Large variances have important structure. This means that data has a high SNR. (3)The principal components are orthogonal. It makes PCA soluble with linear algebra decompositon techniques.

(a) along the first eigenface

(b) along the second eigenface

4. A simple simulation and the division of labour

Figure 3. Variation of the face images along the two eigenfaces given by PCA. Each row plots µ + yi ∗ ui f or yi = −σi : σ3 : σi , i = 1, 2.

4.1. Simulation In the previous section, we know the procedure of PCA, which can be summarized as follows:

In the second part, we divide all the images in two sets: Training Set(images from individuals 1 to 3 and images 15) and Test Set(images from individuals 1 to 3 and images 6-10). We apply PCA to the Training Set, and project the Test Set onto the face subspace given by PCA. Then classify these faces by using 1-nearest-neighbor. The detail shows in the following figures.

• Organize data as an m × n matrix, where m is the dimensionality and n is the number of samples • Substract off the mean for each dimensionality • Calculate the SVD or the eigenvectors of the covariance matrix. Our simulation is face recognition using PCA. The data is a small subset of the Yale B dataset which contains photos of ten individuals under various illumination conditions and we only use images from the first three individuals under ten different illumination conditions. There are two parts of our simulation. In the first part, we apply PCA with k = 2 to all 10 images from individual 1(See Figure 1). And we show the mean face and the first two eigenfaces.(See Figure 2).

Figure 4. The first ten eigenfaces of Training Set, k = 10.

Figure 1. Original faces of person1 under different illumination conditions.

Figure 5. The first ten singular values of corresponding eignefaces.

3

Figure 6. Variation of the accuracy rate by using different number of components.

4.2. Labour It’s hard to describe the detail what each of us has done. So here just gives a roughly division of labour. • Do some reacher, proficiently understand PCA method and the knowledge related to PCA. Done by Wei Ji and Yanglin Zhou. • Come up the question, summary and code. Done by Haifeng Cao and Zhuang Liu.

References [1] H. v. d. H. L.J.P. van der Maaten∗, E.O. Postma. Dimensionality reduction: A comparative review, 2008. [2] A. M. Mart´ınez and A. C. Kak. Pca versus lda. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 23(2):228–233, 2001. [3] L. K. H. Rasmus Elsborg Madsen and O. Winther. Singular value decomposition and principal component analysis, 2004. [4] Y. M. Rene Vidal and S. Sastry. Generalized principal component analysis: Estimation and segmentation of hybrid models, 2006. [5] J. Shlens. A tutorial on principal component analysis, 2009. [6] L. I. Smith. A tutorial on principal components analysis, 2002.

[6] [5] [1] [2] [3] [4]

4

Lihat lebih banyak...

Principal Component Analysis

Descripción

Comentarios