PCA, SVD以及代码示例

本文是对PCA和SVD学习的整理笔记，为了避免很多重复内容的工作，我会在介绍概念的时候引用其他童鞋的工作和内容，具体来源我会标记在参考资料中。

一.PCA (Principle component analysis)

PCA（主成分分析）通过线性变换将原始数据变换为一组各维度线性无关的表示，可用于提取数据的主要特征分量，常用于高维数据的降维。

为什么需要降维？以下图为例，图c中的点x y 呈现明显线性相关，假如以数据其实以数据点分布的方向的直线上的投影（一维）已经能够很好的描述这组数据特点了。明显的，将数据维度降低：1能够降低数据计算量 2压缩数据重构 3.部分情况下甚至能够改善数据特征。

　　那么如何在降维时尽量保留源数据的特征，PCA就是一种。关于如何理解，PCA,通常可以用两种方式进行理解：一是让降维后的数据分布尽量分散能够保留信息（方差尽量大）二是降维导致的信息损失尽量小。关于第一种理解方式，大家可以参考这里，细致而清晰。第二种方法通常需要简单的公式推导，利用拉格朗日乘子将带约束的优化转化为无约束优化后求导，有兴趣的童鞋可以参考这里.

上面两篇文章关于两个不同方向解释PCA，那么这里就直接写出PCA的降维方法，假设原数据为X：

　　　　设有m条n个特征的数据。

　　　　1）将原始数据按列组成n行m列矩阵X，即每一列代表一组数据

　　　　2）将X的每一行（代表一个属性字段）进行零均值化，即减去这一行的均值

　　　　3）求出协方差矩阵[C=frac{1}{m}XX^{T}]

　　　　4）求出协方差矩阵的特征值及对应的特征向量(对[XX^{T}]进行特征分解)

　　　　5）将特征向量按对应特征值大小从上到下按行排列成矩阵，取前k行组成矩阵P

　　　　6） $Y = P X$

$Y = P X$

二 SVD(Singular value decomposition)

$Y = P X$

$Y = P X$

$Y = P X$

$Y = P X$

　也就是说，奇异值分解可以说是包含了特征分解！来看Wikipedia的解释：

在矩阵M的奇异值分解中

$M = USigma V^*, \,$

V的列（columns）组成一套对 $M\,$

U的列（columns）组成一套对 $M\,$

Σ对角线上的元素是奇异值，可视为是在输入与输出间进行的标量的"膨胀控制"。这些是 $MM^*\,$

这里的*标识转置T。看到其中U就是MM*的特征向量了，那么也就是说利用奇异值分解也可以做PCA了，而且还不用求[XX^{T}]！

不仅如此，单独观看奇异值分解的式子，我们也可以利用主成分的思想，利用奇异值分解的公式对高维数据进行压缩，具体看下面的代码。

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt


def decide_k(s, ratio):
    sum_tmp = 0
    sum_s = np.sum(s)
    k = 0
    for i in s:
        k += 1
        sum_tmp += i
        if (sum_tmp / sum_s) >= ratio:
            print("reduce dims is:", k)
            return k

    if k >= s.shape:
        raise ValueError('input dim could not less than compress dims')


def svd_refactor(x, ratio=0.90):  # compress to a k dims data

    before = x.shape[0] * x.shape[1]
    print("before compress:", before)

    # after svd, save cu cv and cs ,then we could use them to refactor picture
    mean_ = np.mean(x, axis=1, keepdims=True)
    x = x - mean_
    u, s, v = np.linalg.svd(x)
    k = decide_k(s, ratio)
    c_u = u[:, :k]
    c_v = v[:k, :]
    c_s = s[0:k]

    after = c_u.shape[0] * c_u.shape[1] + c_v.shape[0] * c_v.shape[1] + c_s.shape[0]
    print("after compress:", after)
    print("ratio", after / before)

    # refactor
    s_s = np.diag(c_s)
    return np.dot(c_u, np.dot(s_s, c_v))


def pca_refactor(x, ratio=0.90):  # compress to a k dims data

    before = x.shape[0] * x.shape[1]
    print("before pca:", before)

    # after svd, save cu cv and cs ,then we could use them to refactor picture
    mean_ = np.mean(x, axis=1, keepdims=True)
    x = x - mean_
    u, s, v = np.linalg.svd(x)
    k = decide_k(s, ratio)
    c_u = u[:, :k]

    eig_vec = c_u.transpose()
    pca_result = np.dot(eig_vec, x)

    after = c_u.shape[0] * c_u.shape[1] + pca_result.shape[0] * pca_result.shape[1]
    print("after pca:", after)
    print("ratio", after / before)

    # since U*U=I
    return np.dot(c_u, pca_result)


if __name__ == '__main__':
    img_file = Image.open('test.jpg').convert('L')  # convert picture to gray
    img_array = np.array(img_file)
    print(img_array.shape)

    img_array = pca_refactor(img_array)

    plt.figure("beauty")
    plt.imshow(img_array, cmap=plt.cm.gray)
    plt.axis('off')
    plt.show()

其中关于如何选择降低维度到多少维的decide_k函数，采用了贡献率。就是指当剩余特征值和的比例小于一定百分比（0.05）的时候舍弃他们。

Reference:

李航《统计学习方法》

Relationship between SVD and PCA. How to use SVD to perform PCA?

PCA的数学原理

机器学习中的数学(5)-强大的矩阵奇异值分解(SVD)及其应用

机器学习中的SVD和PCA.知乎

奇异值的物理意义是什么？

矩阵的奇异值与特征值有什么相似之处与区别之处

从PCA和SVD的关系拾遗