机器学习sklearn（九）：特征工程（二）特征离散化（二）特征二值化

特征二值化是将数值特征用阈值过滤得到布尔值的过程。这对于下游的概率型模型是有用的，它们假设输入数据是多值伯努利分布(Bernoulli distribution) 。例如这个示例 sklearn.neural_network.BernoulliRBM 。

即使归一化计数(又名术语频率)和TF-IDF值特征在实践中表现稍好一些，文本处理团队也常常使用二值化特征值(这可能会简化概率估计)。

相比于 Normalizer ，实用程序类 Binarizer 也被用于 sklearn.pipeline.Pipeline 的早期步骤中。因为每个样本被当做是独立于其他样本的，所以 fit 方法是无用的:

>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]

>>> binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
>>> binarizer
Binarizer(copy=True, threshold=0.0)

>>> binarizer.transform(X)
array([[ 1.,  0.,  1.],
 [ 1.,  0.,  0.],
 [ 0.,  1.,  0.]])

也可以为二值化器赋一个阈值:

>>> binarizer = preprocessing.Binarizer(threshold=1.1)
>>> binarizer.transform(X)
array([[ 0.,  0.,  1.],
 [ 1.,  0.,  0.],
 [ 0.,  0.,  0.]])

相比于 StandardScaler 和 Normalizer 类的情况，预处理模块提供了一个相似的函数 binarize ，以便不需要转换接口时使用。

稀疏输入

binarize 以及 Binarizer 接收来自scipy.sparse的密集类数组数据以及稀疏矩阵作为输入。

对于稀疏输入，数据被转化为压缩的稀疏行形式 (参见 scipy.sparse.csr_matrix )。为了避免不必要的内存复制，推荐在上游选择CSR表示。

class sklearn.preprocessing.Binarizer(*, threshold=0.0, copy=True)

Binarize data (set feature values to 0 or 1) according to a threshold.

Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.

It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modelled using the Bernoulli distribution in a Bayesian setting).

Read more in the User Guide.

Parameters

thresholdfloat, default=0.0: Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.
copybool, default=True: set to False to perform inplace binarization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

Methods

`fit`(X[, y])	Do nothing and return the estimator unchanged.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_params`([deep])	Get parameters for this estimator.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X[, copy])	Binarize each element of X.

Examples

>>> from sklearn.preprocessing import Binarizer
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> transformer = Binarizer().fit(X)  # fit does nothing.
>>> transformer
Binarizer()
>>> transformer.transform(X)
array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

机器学习sklearn（九）： 特征工程（二）特征离散化（二）特征二值化

机器学习sklearn（九）：特征工程（二）特征离散化（二）特征二值化