[sklearn] 官方例程-Imputing missing values before building an estimator 随机填充缺失值

官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot-missing-values-py

该例程是为了说明对缺失值的随即填充训练出的estimator表现优于直接删掉有缺失字段值的estimator

例程代码及附加注释如下:

---------------------------------------------

import numpy as np

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.model_selection import cross_val_score

# 设定随机数种子 rng = np.random.RandomState(0) # 载入数据 波士顿房价 dataset = load_boston() X_full, y_full = dataset.data, dataset.target n_samples = X_full.shape[0] n_features = X_full.shape[1] # Estimate the score on the entire dataset, with no missing values
# 随机森林--回归 random_state-随机种子 n_estimator 森林里树的数目 estimator = RandomForestRegressor(random_state=0, n_estimators=100)
# 交叉验证分类器的准确率 score = cross_val_score(estimator, X_full, y_full).mean() print("Score with the entire dataset = %.2f" % score) # Add missing values in 75% of the lines missing_rate = 0.75 n_missing_samples = int(np.floor(n_samples * missing_rate))
# hstack 把两个数组拼接起来-行数需要一致 missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples, dtype=np.bool), np.ones(n_missing_samples, dtype=np.bool)))
# 打乱随机数组顺序
rng.shuffle(missing_samples) missing_features = rng.randint(0, n_features, n_missing_samples) # Estimate the score without the lines containing missing values X_filtered = X_full[~missing_samples, :] y_filtered = y_full[~missing_samples] estimator = RandomForestRegressor(random_state=0, n_estimators=100) score = cross_val_score(estimator, X_filtered, y_filtered).mean() print("Score without the samples containing missing values = %.2f" % score) # Estimate the score after imputation of the missing values X_missing = X_full.copy() X_missing[np.where(missing_samples)[0], missing_features] = 0 y_missing = y_full.copy() estimator = Pipeline([("imputer", Imputer(missing_values=0, strategy="mean", axis=0)), ("forest", RandomForestRegressor(random_state=0, n_estimators=100))]) score = cross_val_score(estimator, X_missing, y_missing).mean() print("Score after imputation of the missing values = %.2f" % score)

---------------------------------------------------
补充:
A. numpy.where()用法:

原文地址:https://www.cnblogs.com/Arborday/p/8327426.html