机器学习sklearn（二十四）：模型评估（四）量化预测的质量（一）scoring 参数: 定义模型评估规则

有 3 种不同的 API 用于评估模型预测的质量:

Estimator score method（估计器得分的方法）: Estimators（估计器）有一个 score（得分） 方法，为其解决的问题提供了默认的 evaluation criterion （评估标准）。在这个页面上没有相关讨论，但是在每个 estimator （估计器）的文档中会有相关的讨论。
Scoring parameter（评分参数）: Model-evaluation tools （模型评估工具）使用 cross-validation (如 model_selection.cross_val_score 和 model_selection.GridSearchCV) 依靠 internal scoring strategy （内部 scoring（得分） 策略）。这在 scoring 参数: 定义模型评估规则部分讨论。
Metric functions（指标函数）: metrics 模块实现了针对特定目的评估预测误差的函数。这些指标在以下部分部分详细介绍分类指标, 多标签排名指标, 回归指标和聚类指标。

最后，虚拟估计用于获取随机预测的这些指标的基准值。

See also:对于 “pairwise（成对）” metrics（指标），samples（样本） 之间而不是 estimators （估计量）或者 predictions（预测值），请参阅成对的矩阵, 类别和核函数部分。dr

1. `scoring` 参数: 定义模型评估规则

Model selection （模型选择）和 evaluation （评估）使用工具，例如 model_selection.GridSearchCV 和 model_selection.cross_val_score ，采用 scoring 参数来控制它们对 estimators evaluated （评估的估计量）应用的指标。

3.3.1.1. 常见场景: 预定义值

对于最常见的用例, 您可以使用 scoring 参数指定一个 scorer object （记分对象）; 下表显示了所有可能的值。所有 scorer objects （记分对象）遵循惯例 higher return values are better than lower return values（较高的返回值优于较低的返回值）。因此，测量模型和数据之间距离的 metrics （度量），如 metrics.mean_squared_error 可用作返回 metric （指数）的 negated value （否定值）的 neg_mean_squared_error 。

Scoring（得分）	Function（函数）	Comment（注解）
Classification（分类）
‘accuracy’	`metrics.accuracy_score`
‘average_precision’	`metrics.average_precision_score`
‘f1’	`metrics.f1_score`	for binary targets（用于二进制目标）
‘f1_micro’	`metrics.f1_score`	micro-averaged（微平均）
‘f1_macro’	`metrics.f1_score`	macro-averaged（宏平均）
‘f1_weighted’	`metrics.f1_score`	weighted average（加权平均）
‘f1_samples’	`metrics.f1_score`	by multilabel sample（通过 multilabel 样本）
‘neg_log_loss’	`metrics.log_loss`	requires `predict_proba` support（需要 `predict_proba` 支持）
‘precision’ etc.	`metrics.precision_score`	suffixes apply as with ‘f1’（后缀适用于 ‘f1’）
‘recall’ etc.	`metrics.recall_score`	suffixes apply as with ‘f1’（后缀适用于 ‘f1’）
‘roc_auc’	`metrics.roc_auc_score`
Clustering（聚类）
‘adjusted_mutual_info_score’	`metrics.adjusted_mutual_info_score`
‘adjusted_rand_score’	`metrics.adjusted_rand_score`
‘completeness_score’	`metrics.completeness_score`
‘fowlkes_mallows_score’	`metrics.fowlkes_mallows_score`
‘homogeneity_score’	`metrics.homogeneity_score`
‘mutual_info_score’	`metrics.mutual_info_score`
‘normalized_mutual_info_score’	`metrics.normalized_mutual_info_score`
‘v_measure_score’	`metrics.v_measure_score`
Regression（回归）
‘explained_variance’	`metrics.explained_variance_score`
‘neg_mean_absolute_error’	`metrics.mean_absolute_error`
‘neg_mean_squared_error’	`metrics.mean_squared_error`
‘neg_mean_squared_log_error’	`metrics.mean_squared_log_error`
‘neg_median_absolute_error’	`metrics.median_absolute_error`
‘r2’	`metrics.r2_score`

使用示例:

>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import cross_val_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf = svm.SVC(probability=True, random_state=0)
>>> cross_val_score(clf, X, y, scoring='neg_log_loss')
array([-0.07..., -0.16..., -0.06...])
>>> model = svm.SVC()
>>> cross_val_score(model, X, y, scoring='wrong_choice')
Traceback (most recent call last):
ValueError: 'wrong_choice' is not a valid scoring value. Valid options are ['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']

注意

ValueError exception 列出的值对应于以下部分描述的 functions measuring prediction accuracy （测量预测精度的函数）。这些函数的 scorer objects （记分对象）存储在 dictionary sklearn.metrics.SCORERS 中。

1.2. 根据 metric 函数定义您的评分策略

模块 sklearn.metrics 还公开了一组 measuring a prediction error （测量预测误差）的简单函数，给出了基础真实的数据和预测:

函数以 _score 结尾返回一个值来最大化，越高越好。
函数 _error 或 _loss 结尾返回一个值来 minimize （最小化），越低越好。当使用 make_scorer 转换成 scorer object （记分对象）时，将 greater_is_better 参数设置为 False（默认为 True; 请参阅下面的参数说明）。

可用于各种机器学习任务的 Metrics （指标）在下面详细介绍。

许多 metrics （指标）没有被用作 scoring（得分） 值的名称，有时是因为它们需要额外的参数，例如 fbeta_score 。在这种情况下，您需要生成一个适当的 scoring object （评分对象）。生成 callable object for scoring （可评估对象进行评分）的最简单方法是使用 make_scorer 。该函数将 metrics （指数）转换为可用于可调用的 model evaluation （模型评估）。

一个典型的用例是从库中包含一个非默认值参数的 existing metric function （现有指数函数），例如 fbeta_score 函数的 beta 参数:

>>> from sklearn.metrics import fbeta_score, make_scorer
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

第二个用例是使用 make_scorer 从简单的 python 函数构建一个完全 custom scorer object （自定义的记分对象），可以使用几个参数 :

你要使用的 python 函数（在下面的示例中是 my_custom_loss_func）
python 函数是否返回一个分数 (greater_is_better=True, 默认值) 或者一个 loss （损失） (greater_is_better=False)。如果是一个 loss （损失），scorer object （记分对象）的 python 函数的输出被 negated （否定），符合 cross validation convention （交叉验证约定），scorers 为更好的模型返回更高的值。
仅用于 classification metrics （分类指数）: 您提供的 python 函数是否需要连续的 continuous decision certainties （判断确定性）（needs_threshold=True）。默认值为 False 。
任何其他参数，如 beta 或者 labels 在函数 f1_score 。

以下是建立 custom scorers （自定义记分对象）的示例，并使用 greater_is_better 参数:

>>> import numpy as np
>>> def my_custom_loss_func(y_true, y_pred):
...     diff = np.abs(y_true - y_pred).max()
...     return np.log1p(diff)
...
>>> # score will negate the return value of my_custom_loss_func,
>>> # which will be np.log(2), 0.693, given the values for X
>>> # and y defined below.
>>> score = make_scorer(my_custom_loss_func, greater_is_better=False)
>>> X = [[1], [1]]
>>> y = [0, 1]
>>> from sklearn.dummy import DummyClassifier
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0)
>>> clf = clf.fit(X, y)
>>> my_custom_loss_func(clf.predict(X), y)
0.69...
>>> score(clf, X, y)
-0.69...

1.3. 实现自己的记分对象

您可以通过从头开始构建自己的 scoring object （记分对象），而不使用 make_scorer factory 来生成更加灵活的 model scorers （模型记分对象）。对于被叫做 scorer 来说，它需要符合以下两个规则所指定的协议:

可以使用参数 (estimator, X, y) 来调用它，其中 estimator 是要被评估的模型，X 是验证数据， y 是 X (在有监督情况下) 或 None (在无监督情况下) 已经被标注的真实数据目标。
它返回一个浮点数，用于对 X 进行量化 estimator 的预测质量，参考 y 。再次，按照惯例，更高的数字更好，所以如果你的 scorer 返回 loss ，那么这个值应该被 negated 。

注意:在n_jobs > 1的函数中使用自定义评分器

虽然在调用函数的旁边定义自定义计分函数应该使用默认的joblib后端(loky)，但是从另一个模块导入它将是一种更健壮的方法，并且独立于joblib后端。

例如，在下面的示例中，要使用大于1的n_jobs,custom_scoring_function函数保存在用户创建的模块中(custom_scorer_module.py)并导入:
>> from custom_scorer_module import custom_scoring_function
>> cross_val_score(model,
  ...  X_train,
  ...  y_train,
  ...  scoring=make_scorer(custom_scoring_function, greater_is_better=False),
  ...  cv=5,
  ...  n_jobs=-1)

1.4. 使用多个指数评估

Scikit-learn 还允许在 GridSearchCV, RandomizedSearchCV 和 cross_validate 中评估 multiple metric （多个指数）。

为 scoring 参数指定多个评分指标有两种方法:

作为 string metrics 的迭代:

>>> scoring = ['accuracy', 'precision']

作为 dict ，将 scorer 名称映射到 scoring 函数:

>>> from sklearn.metrics import accuracy_score
>>> from sklearn.metrics import make_scorer
>>> scoring = {'accuracy': make_scorer(accuracy_score),
...            'prec': 'precision'

请注意， dict 值可以是 scorer functions （记分函数）或者 predefined metric strings （预定义 metric 字符串）之一。

目前，只有那些返回 single score （单一分数）的 scorer functions （记分函数）才能在 dict 内传递。不允许返回多个值的 Scorer functions （Scorer 函数），并且需要一个 wrapper 才能返回 single metric（单个指标）:

>>> from sklearn.model_selection import cross_validate
>>> from sklearn.metrics import confusion_matrix
>>> # A sample toy binary classification dataset
>>> X, y = datasets.make_classification(n_classes=2, random_state=0)
>>> svm = LinearSVC(random_state=0)
>>> def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
>>> def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
>>> def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
>>> def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
>>> scoring = {'tp': make_scorer(tp), 'tn': make_scorer(tn),
...            'fp': make_scorer(fp), 'fn': make_scorer(fn)}
>>> cv_results = cross_validate(svm.fit(X, y), X, y,
...                             scoring=scoring, cv=5)
>>> # Getting the test set true positive scores
>>> print(cv_results['test_tp'])  
[10  9  8  7  8]
>>> # Getting the test set false negative scores
>>> print(cv_results['test_fn'])  
[0 1 2 3 2]