Python中的主成分分析（PCA）

我有一个（26424 x 144）数组，我想用Python执行PCA。然而，在networking上没有特别的地方可以解释如何完成这个任务（有些网站只是根据自己的需要来做PCA–没有一个通用的方法可以find）。任何有帮助的人都会做得很好。

你可以在matplotlib模块中find一个PCA函数：

 import numpy as np from matplotlib.mlab import PCA data = np.array(np.random.randint(10,size=(10,3))) results = PCA(data)

结果将存储PCA的各种参数。它来自matplotlib的mlab部分，它是MATLAB语法的兼容层

编辑：在博客nextgenetics我发现了如何执行和显示PCP与matplotlib mlab模块的一个奇妙的演示，玩得开心，并检查该博客！

即使已经接受了另一个答案，我也发布了答案。接受的答案依赖于不推荐的函数 ; 此外，这个弃用的函数是基于奇异值分解 （SVD），这是（虽然完全有效）是计算PCA的两种一般技术的更多的存储器和处理器密集型。由于OP中的数据数组的大小，在这里特别重要。使用基于协方差的PCA，计算stream程中使用的数组仅为144 x 144 ，而不是26424 x 144 （原始数据数组的数组）。

以下是使用SciPy的linalg模块的PCA的简单工作实现。因为这个实现首先计算协方差matrix，然后在这个数组上执行所有的后续计算，所以它使用的内存比基于SVD的PCAless得多。

（ NumPy中的linalg模块也可以在下面的代码中使用，除了导入语句以外，代码如下：）。

这个PCA实施的两个关键步骤是：

计算协方差matrix ; 和
取这个covmatrix的eivenvectors ＆ eigenvalues

在下面的函数中，参数dims_rescaled_data指向重新缩放的数据matrix中所需的维数; 这个参数的默认值只有两个维度，但是下面的代码不限于两个，但是它可以是小于原始数据数组的列数的任何值。

 def PCA(data, dims_rescaled_data=2): """ returns: data transformed in 2 dims/columns + regenerated original data pass in: data as 2D NumPy array """ import numpy as NP from scipy import linalg as LA m, n = data.shape # mean center the data data -= data.mean(axis=0) # calculate the covariance matrix R = NP.cov(data, rowvar=False) # calculate eigenvectors & eigenvalues of the covariance matrix # use 'eigh' rather than 'eig' since R is symmetric, # the performance gain is substantial evals, evecs = LA.eigh(R) # sort eigenvalue in decreasing order idx = NP.argsort(evals)[::-1] evecs = evecs[:,idx] # sort eigenvectors according to same index evals = evals[idx] # select the first n eigenvectors (n is desired dimension # of rescaled data array, or dims_rescaled_data) evecs = evecs[:, :dims_rescaled_data] # carry out the transformation on the data using eigenvectors # and return the re-scaled data, eigenvalues, and eigenvectors return NP.dot(evecs.T, data.T).T, evals, evecs def test_PCA(data, dims_rescaled_data=2): ''' test by attempting to recover original data array from the eigenvectors of its covariance matrix & comparing that 'recovered' array with the original data ''' _ , _ , eigenvectors = PCA(data, dim_rescaled_data=2) data_recovered = NP.dot(eigenvectors, m).T data_recovered += data_recovered.mean(axis=0) assert NP.allclose(data, data_recovered) def plot_pca(data): from matplotlib import pyplot as MPL clr1 = '#2026B2' fig = MPL.figure() ax1 = fig.add_subplot(111) data_resc, data_orig = PCA(data) ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1) MPL.show() >>> # iris, probably the most widely used reference data set in ML >>> df = "~/iris.csv" >>> data = NP.loadtxt(df, delimiter=',') >>> # remove class labels >>> data = data[:,:-1] >>> plot_pca(data)

下面的图是虹膜数据上的这个PCAfunction的直观表示。正如你所看到的，2D变换将I类与II类和III类完全分开（但不是III类的II类，实际上需要另一个维度）。

在这里输入图像说明

另一个使用numpy的Python PCA。与@doug相同的想法，但没有运行。

 from numpy import array, dot, mean, std, empty, argsort from numpy.linalg import eigh, solve from numpy.random import randn from matplotlib.pyplot import subplots, show def cov(data): """ Covariance matrix note: specifically for mean-centered data note: numpy's `cov` uses N-1 as normalization """ return dot(XT, X) / X.shape[0] # N = data.shape[1] # C = empty((N, N)) # for j in range(N): # C[j, j] = mean(data[:, j] * data[:, j]) # for k in range(j + 1, N): # C[j, k] = C[k, j] = mean(data[:, j] * data[:, k]) # return C def pca(data, pc_count = None): """ Principal component analysis using eigenvalues note: this mean-centers and auto-scales the data (in-place) """ data -= mean(data, 0) data /= std(data, 0) C = cov(data) E, V = eigh(C) key = argsort(E)[::-1][:pc_count] E, V = E[key], V[:, key] U = dot(data, V) # used to be dot(VT, data.T).T return U, E, V """ test data """ data = array([randn(8) for k in range(150)]) data[:50, 2:4] += 5 data[50:, 2:5] += 5 """ visualize """ trans = pca(data, 3)[0] fig, (ax1, ax2) = subplots(1, 2) ax1.scatter(data[:50, 0], data[:50, 1], c = 'r') ax1.scatter(data[50:, 0], data[50:, 1], c = 'b') ax2.scatter(trans[:50, 0], trans[:50, 1], c = 'r') ax2.scatter(trans[50:, 0], trans[50:, 1], c = 'b') show()

同样的东西产生的更短

 from sklearn.decomposition import PCA def pca2(data, pc_count = None): return PCA(n_components = 4).fit_transform(data)

据我所知，使用特征值（第一种方法）更适合高维数据和更less的样本，而使用奇异值分解更好，如果您有更多的样本比维度。

这是一个numpy的工作。

下面是一个教程，演示如何使用numpy的内置模块（如mean,cov,double,cumsum,dot,linalg,array,rank来完成pincipal组件分析。

http://glowingpython.blogspot.sg/2011/07/principal-component-analysis-with-numpy.html

请注意， scipy也有一个很长的解释 – https://github.com/scikit-learn/scikit-learn/blob/babe4a5d0637ca172d47e1dfdd2f6f3c3ecb28db/scikits/learn/utils/extmath.py#L105

库matplotlib.mlab.PCA （在这个答案中使用）不被弃用。所以对于通过Google来到这里的所有人来说，我会发布一个完整的工作示例，用Python 2.7进行testing。

 from matplotlib.mlab import PCA import numpy data = numpy.array( [[3,2,5], [-2,1,6], [-1,0,4], [4,3,4], [10,-5,-6]] ) pca = PCA(data)

现在在`pca.Y'中是以主成分基向量表示的原始数据matrix。有关PCA对象的更多细节可以在这里find。

 >>> pca.Y array([[ 0.67629162, -0.49384752, 0.14489202], [ 1.26314784, 0.60164795, 0.02858026], [ 0.64937611, 0.69057287, -0.06833576], [ 0.60697227, -0.90088738, -0.11194732], [-3.19578784, 0.10251408, 0.00681079]])

您可以使用matplotlib.pyplot来绘制这些数据，只是为了说服自己，PCA会产生“好的”结果。名单只是用来注释我们的五个向量。

 import matplotlib.pyplot names = [ "A", "B", "C", "D", "E" ] matplotlib.pyplot.scatter(pca.Y[:,0], pca.Y[:,1]) for label, x, y in zip(names, pca.Y[:,0], pca.Y[:,1]): matplotlib.pyplot.annotate( label, xy=(x, y), xytext=(-2, 2), textcoords='offset points', ha='right', va='bottom' ) matplotlib.pyplot.show()

看看我们的原始vector，我们将看到数据[0]（“A”）和数据[3]（“D”）与数据[1]（“B”）和数据[2] C”）。这反映在我们的PCA转换数据的二维图中。

PCA结果图

为了def plot_pca(data):将工作，有必要更换线

 data_resc, data_orig = PCA(data) ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1)

与线

 newData, data_resc, data_orig = PCA(data) ax1.plot(newData[:, 0], newData[:, 1], '.', mfc=clr1, mec=clr1)

Python中的主成分分析（PCA）

保存朴素贝叶斯训练分类器在NLTK

公开发布的垃圾邮件filter培训集

改变训练的图像训练neural network

build议用户可以购买什么，如果他已经有东西在购物车

Numpy 1热arrays

在C＃中的机器学习库

我如何find与Python的Wally？

推荐用于R中非常大的数据集处理和机器学习的软件包

任何人推荐一个有条件的随机领域的好教程

如何将数据分成3组（火车，validation和testing）？