Column Transformer with Mixed Types -- of sklearn

Column Transformer with Mixed Types

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

使用ColumnTransformer, 应用不同的预处理和特征提取管道，到不同的特征子集上。

这个工具是非常便利的对于处理异构数据集。

例如对数值型数据进行缩放，对分类型数据进行 one-hot 编码。

This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of features, using ColumnTransformer. This is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the categorical ones.

对于数值型数据，先执行数据填充，使用中位数进行填充；然后进行标准缩放。

对于分类型数据，先试用 missing值进行填充，然后进行编码。

In this example, the numeric data is standard-scaled after mean-imputation, while the categorical data is one-hot encoded after imputing missing values with a new category ('missing').

可以使用不同的方法，将列分发到特定的处理器上，按照列名或者按照列的数据类型。

In addition, we show two different ways to dispatch the columns to the particular pre-processor: by column names and by column data types.

最终，预处理管道被进程到一个完整的预测管道上，对接上一个简单的分类模型。

Finally, the preprocessing pipeline is integrated in a full prediction pipeline using Pipeline, together with a simple classification model.

Use `ColumnTransformer` by selecting column by names

　　（1）对数值型数据构建一个变换流水线： numeric_transformer，包括数据填充器SimpleImputer 和标准变换器 StandardScaler，这是一个子流水线，后面集成到 ColumnTransformer 中。

（2）对分类型数据，建立一个变换器 categorical_transformer 。

然后将（1）（2）集成到 ColumnTransformer 中，按照列名集成，构成列变换器，命名为 preprocessor

最后将 preprocesssor 和 LogisticRegression 模型集成为最终的流水线。

We will train our classifier with the following features:

Numeric Features:

age: float;

fare: float.

Categorical Features:

embarked: categories encoded as strings {'C', 'S', 'Q'};

sex: categories encoded as strings {'female', 'male'};

pclass: ordinal integers {1, 2, 3}.

We create the preprocessing pipelines for both numeric and categorical data. Note that pclass could either be treated as a categorical or numeric feature.

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

HTML representation of `Pipeline`

查看流水线

When the Pipeline is printed out in a jupyter notebook an HTML representation of the estimator is displayed as follows:
from sklearn import set_config

set_config(display='diagram')
clf

StandardScaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

数值数据，由于单位不同，量值表示的含义不同，导致量值的范围，有的很大，有的很小。

如果不进行正规化，则具有较大变化范围的特征，将会成为模型的主要影响力量，其它较小范围的变量，影响力可能会被抛弃。

正规化到相似的范围空间，则可以平均化每个特征的影响能力。

Standardize features by removing the mean and scaling to unit variance

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler()
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]

Use `ColumnTransformer` by selecting column by data types

对于不同类型数据，如果都有相同的处理过程，则可以使用列选择器，选择出对应类型的列，送入对应的列的变换器或者预处理的子流水线。

When dealing with a cleaned dataset, the preprocessing can be automatic by using the data types of the column to decide whether to treat a column as a numerical or categorical feature. sklearn.compose.make_column_selector gives this possibility. First, let’s only select a subset of columns to simplify our example.
subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare']
X_train, X_test = X_train[subset_feature], X_test[subset_feature]
Then, we introspect the information regarding each column data type.
X_train.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1047 entries, 1118 to 684
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   embarked  1045 non-null   category
 1   sex       1047 non-null   category
 2   pclass    1047 non-null   float64
 3   age       841 non-null    float64
 4   fare      1046 non-null   float64
dtypes: category(2), float64(3)
memory usage: 35.0 KB
We can observe that the embarked and sex columns were tagged as category columns when loading the data with fetch_openml. Therefore, we can use this information to dispatch the categorical columns to the categorical_transformer and the remaining columns to the numerical_transformer.

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, selector(dtype_exclude="category")),
    ('cat', categorical_transformer, selector(dtype_include="category"))
])
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])


clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

Using the prediction pipeline in a grid search

使用网格搜索，来确定流水线中，节点参数，包括数值型填充策略，和分类器参数。

Grid search can also be performed on the different preprocessing steps defined in the ColumnTransformer object, together with the classifier’s hyperparameters as part of the Pipeline. We will search for both the imputer strategy of the numeric preprocessing and the regularization parameter of the logistic regression using GridSearchCV.
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
}

grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search

查看最好的结果对应的参数。

Calling ‘fit’ triggers the cross-validated search for the best hyper-parameters combination:
grid_search.fit(X_train, y_train)

print(f"Best params:")
print(grid_search.best_params_)
Out:
Best params:
{'classifier__C': 0.1, 'preprocessor__num__imputer__strategy': 'mean'}
The internal cross-validation scores obtained by those parameters is:
print(f"Internal CV score: {grid_search.best_score_:.3f}")
Out:
Internal CV score: 0.784

查看所有参数哦搜索组合中得分较大的前五组

We can also introspect the top grid search results as a pandas dataframe:
import pandas as pd

cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results = cv_results.sort_values("mean_test_score", ascending=False)
cv_results[["mean_test_score", "std_test_score",
            "param_preprocessor__num__imputer__strategy",
            "param_classifier__C"
            ]].head(5)
mean_test_score std_test_score param_preprocessor__num__imputer__strategy param_classifier__C

0 0.784167 0.035824 mean 0.1

2 0.780366 0.032722 mean 1

1 0.780348 0.037245 median 0.1

4 0.779414 0.033105 mean 10

6 0.779414 0.033105 mean 100

The best hyper-parameters have be used to re-fit a final model on the full training set. We can evaluate that final model on held out test data that was not used for hyparameter tuning.
print(("best logistic regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))
Out:
best logistic regression from grid search: 0.794

	mean_test_score	std_test_score	param_preprocessor__num__imputer__strategy	param_classifier__C
0	0.784167	0.035824	mean	0.1
2	0.780366	0.032722	mean	1
1	0.780348	0.037245	median	0.1
4	0.779414	0.033105	mean	10
6	0.779414	0.033105	mean	100

出处：http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。