1、分类变量的特征提取

比如城市作为一个特征，那么就是一系列散列的城市标记，这类特征我们用二进制编码来表示，是这个城市为1，不是这个城市为0

比如有三个城市：北京、天津、上海，我们用scikit-learn的DictVector做特征提取，如下：

from sklearn.feature_extraction import DictVectorizer
onehot_encoder = DictVectorizer()
instances = [{'city': '北京'},{'city': '天津'}, {'city': '上海'}]
print(onehot_encoder.fit_transform(instances).toarray())

编码结果：

[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]

2、文字特征提取

文字特征无非这几种：有这个词还是没有、这个词的TF-IDF

第一种情况用词库表示法，如下：

1 from sklearn.feature_extraction.text import CountVectorizer
2 corpus = ['UNC played Duke in basketball', 'Duke lost the basketball game' ]
3 vectorizer = CountVectorizer()
4 print(vectorizer.fit_transform(corpus).todense())
5 vectorizer.vocabulary_

编码结果：

1 [[1 1 0 1 0 1 0 1]
2  [1 1 1 0 1 0 1 0]]
3 {u'duke': 1, u'basketball': 0, u'lost': 4, u'played': 5, u'game': 2, u'unc': 7, u'in': 3, u'the': 6}

数值为1表示词表中的这个词出现，为0表示未出现，词表中的数值表示单词的坐标位置。这个是按照字母先后顺序排的。

第二种情况TF-IDF表示词的重要性，如下：

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['The dog ate a sandwich and I ate a sandwich', 'The wizard transfigured a sandwich' ]
vectorizer = TfidfVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

结果：

[[ 0.75458397  0.37729199  0.53689271  0.          0.        ]
 [ 0.          0.          0.44943642  0.6316672   0.6316672 ]]
{u'sandwich': 2, u'wizard': 4, u'dog': 1, u'transfigured': 3, u'ate': 0}

值最高的是第一个句子中的ate，因为它在这一个句子里出现了两次。

值最低的自然是本句子未出现的单词。

3、数据标准化

数据标准化就是把数据转成均值为0，是单位方差的。比如对如下矩阵做标准化：

1 from sklearn import preprocessing
2 import numpy as np
3 X = np.array([[0., 0., 5., 13., 9., 1.], [0., 0., 13., 15., 10., 15.], [0., 3., 15., 2., 0., 11.]])
4 print(preprocessing.scale(X))

执行结果：

1 [[ 0.         -0.70710678 -1.38873015  0.52489066  0.59299945 -1.35873244]
2  [ 0.         -0.70710678  0.46291005  0.87481777  0.81537425  1.01904933]
3  [ 0.          1.41421356  0.9258201  -1.39970842 -1.4083737   0.33968311]]