特征标准化

原因

特征数字差值很大的属性会对计算结果产生很大的影响，当我们认为特征是等权重的时候，因为取值范围不同，因此要进行归一化

例子

time	distance	weight
1.2	5000	80
1.6	6000	90
1.0	3000	50

例如我们认为，time，distance，weight三个权重是一样的，在做特征分析的时候会明显发现distance对计算结果的影响是最大的。
因此，使用归一化的方法将数值处理到0~1的范围内

最值标准化方法

(x_{new})=((x)-(x_{min}))/((x_{max})-(x_{min}))

cle<-function(df){
    df_new<-(df-min(df))/(max(df)-min(df))
    return df_new
}

均值方差标准化方法

(x_{ ext {scale}}=frac{x-x_{ ext {mean}}}{s})

cle<-function(df){
    df_new<-(df-mean(df))/std(df)
    return df_new
}

python中提供了standardscaler类可以直接对np对象进行均值方差标准化
可以参考

标准化

scale

# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))

<script.py> output:
    Accuracy with Scaling: 0.7700680272108843
    Accuracy without Scaling: 0.6979591836734694

很明显，标准化之后的数据的预测精度更高