机器学习sklearn（十四）：特征工程（五）特征编码（二）特征哈希(二)

特征哈希（相当于一种降维技巧）

类 FeatureHasher 是一种高速，低内存消耗的向量化方法，它使用了特征散列技术，或可称为 “散列法” （hashing trick）的技术。代替在构建训练中遇到的特征的哈希表，如向量化所做的那样 FeatureHasher 将哈希函数应用于特征，以便直接在样本矩阵中确定它们的列索引。结果是以牺牲可检测性为代价，提高速度和减少内存的使用; 哈希表不记得输入特性是什么样的，没有 inverse_transform 办法。

由于散列函数可能导致（不相关）特征之间的冲突，因此使用带符号散列函数，并且散列值的符号确定存储在特征的输出矩阵中的值的符号。这样，碰撞可能会抵消而不是累积错误，并且任何输出要素的值的预期平均值为零。默认情况下，此机制将使用 alternate_sign=True 启用，对于小型哈希表大小（n_features < 10000）特别有用。对于大的哈希表大小，可以禁用它，以便将输出传递给估计器，如 sklearn.naive_bayes.MultinomialNB 或 sklearn.feature_selection.chi2 特征选择器，这些特征选项器可以使用非负输入。

类 FeatureHasher 接受映射（如 Python 的 dict 及其在 collections 模块中的变体），使用键值对 (feature, value) 或字符串，具体取决于构造函数参数 input_type。映射被视为 (feature, value) 对的列表，而单个字符串的隐含值为1，因此 ['feat1', 'feat2', 'feat3'] 被解释为 [('feat1', 1), ('feat2', 1), ('feat3', 1)]。如果单个特征在样本中多次出现，相关值将被求和（所以 ('feat', 2) 和 ('feat', 3.5) 变为 ('feat', 5.5)）。 FeatureHasher 的输出始终是 CSR 格式的 scipy.sparse 矩阵。

特征散列可以在文档分类中使用，但与 text.CountVectorizer 不同，FeatureHasher 不执行除 Unicode 或 UTF-8 编码之外的任何其他预处理; 请参阅下面的哈希技巧向量化大文本语料库，用于组合的 tokenizer/hasher。

例如，有一个词级别的自然语言处理任务，需要从 (token, part_of_speech) 键值对中提取特征。可以使用 Python 生成器函数来提取功能:

def token_features(token, part_of_speech):
    if token.isdigit():
        yield "numeric"
    else:
        yield "token={}".format(token.lower())
        yield "token,pos={},{}".format(token, part_of_speech)
    if token[0].isupper():
        yield "uppercase_initial"
    if token.isupper():
        yield "all_uppercase"
    yield "pos={}".format(part_of_speech)

然后， raw_X 为了可以传入 FeatureHasher.transform 可以通过如下方式构造:

raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)

并传入一个 hasher:

hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)

得到一个 scipy.sparse 类型的矩阵 X。

注意使用发生器的理解，它将懒惰引入到特征提取中：词令牌（token）只能根据需要从哈希值进行处理。

实现细节

类 FeatureHasher 使用签名的 32-bit 变体的 MurmurHash3。因此导致（并且由于限制 scipy.sparse），当前支持的功能的最大数量 $2^{31} - 1$ .

特征哈希的原始形式源于Weinberger et al，使用两个单独的哈希函数，和分别确定特征的列索引和符号。现有的实现是基于假设：MurmurHash3的符号位与其他位独立。

由于使用简单的模数将哈希函数转换为列索引，建议使用2次幂作为 n_features 参数; 否则特征不会均匀的分布到列中。

参考资料:

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and Josh Attenberg (2009). 用于大规模多任务学习的特征散列. Proc. ICML.

MurmurHash3.

class sklearn.feature_extraction.FeatureHasher(n_features=1048576, *, input_type='dict', dtype=<class 'numpy.float64'>, alternate_sign=True)

Implements feature hashing, aka the hashing trick.

This class turns sequences of symbolic feature names (strings) into scipy.sparse matrices, using a hash function to compute the matrix column corresponding to a name. The hash function employed is the signed 32-bit version of Murmurhash3.

Feature names of type byte string are used as-is. Unicode strings are converted to UTF-8 first, but no Unicode normalization is done. Feature values must be (finite) numbers.

This class is a low-memory alternative to DictVectorizer and CountVectorizer, intended for large-scale (online) learning and situations where memory is tight, e.g. when running prediction code on embedded devices.

Read more in the User Guide.

New in version 0.13.

Parameters

n_featuresint, default=2**20: The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
input_type{“dict”, “pair”, “string”}, default=”dict”: Either “dict” (the default) to accept dictionaries over (feature_name, value); “pair” to accept pairs of (feature_name, value); or “string” to accept single strings. feature_name should be a string, while value should be a number. In the case of “string”, a value of 1 is implied. The feature_name is hashed to find the appropriate column for the feature. The value’s sign might be flipped in the output (but see non_negative, below).
dtypenumpy dtype, default=np.float64: The type of feature values. Passed to scipy.sparse matrix constructors as the dtype argument. Do not set this to bool, np.boolean or any unsigned integer type.
alternate_signbool, default=True: When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.

Changed in version 0.19: alternate_sign replaces the now deprecated non_negative parameter.

Examples

>>> from sklearn.feature_extraction import FeatureHasher
>>> h = FeatureHasher(n_features=10)
>>> D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
>>> f = h.transform(D)
>>> f.toarray()
array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])