文章翻译第七章10-12

10 Measuring prediction performance using ROCR

ROCR 测量预测能力

A receiver operating characteristic (ROC) curve is a plot that illustrates the performance of a binary classifier system, and plots the true positive rate against the false positive rate for different cut points. We most commonly use this plot to calculate the area under curve (AUC) to measure the performance of a classification model. In this recipe, we will demonstrate how to illustrate an ROC curve and calculate the AUC to measure the performance of a classification model.受试者工作特征（ROC）曲线是一个图，示出了二进制分类器系统的性能，并绘制真正的阳性率对不同切割点的假阳性率。我们通常使用这个图来计算曲线下面积（AUC）来衡量分类模型的性能。在这个食谱中，我们将演示如何说明一个ROC曲线和计算AUC来衡量分类模型的性能。

Getting ready准备

In this recipe, we will continue using the telecom churn dataset as our example dataset.在这个食谱中，我们将继续使用电信流失数据集作为我们的示例数据集。

How to do it...怎么做

Perform the following steps to generate two different classification examples with

different costs:执行下列步骤以生成两个不同的分类示例不同的成本：

1. First, you should install and load the ROCR package:首先，你应该安装并加载使包

> install.packages("ROCR")

> library(ROCR)

2. Train the svm model using the training dataset with a probability equal to TRUE:训练SVM模型使用的训练数据集的概率等于真

> svmfit=svm(churn~ ., data=trainset, prob=TRUE)

3. Make predictions based on the trained model on the testing dataset with the

probability set as TRUE:预测的基础上受过训练的模型的测试数据集与概率集为真：

>pred=predict(svmfit,testset[, !names(testset) %in% c("churn")],

probability=TRUE)

4. Obtain the probability of labels with yes:得到标签的概率是：

> pred.prob = attr(pred, "probabilities")

> pred.to.roc = pred.prob[, 2]

5. Use the prediction function to generate a prediction result:使用预测函数生成预测结果：

> pred.rocr = prediction(pred.to.roc, testset$churn)

6. Use the performance function to obtain the performance measurement:使用性能函数获得性能测量：

> perf.rocr = performance(pred.rocr, measure = "auc", x.measure =

"cutoff")

> perf.tpr.rocr = performance(pred.rocr, "tpr","fpr")

7. Visualize the ROC curve using the plot function:利用图函数可视化ROC曲线：

> plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.

values)))

Figure 6: The ROC curve for the svm classifier performance支持向量机分类器性能的ROC曲线

How it works...怎么做

In this recipe, we demonstrated how to generate an ROC curve to illustrate the performance of a binary classifier. First, we should install and load the library, ROCR. Then, we use svm, from the e1071 package, to train a classification model, and then use the model to predict labels for the testing dataset. Next, we use the prediction functio(from the package, ROCR) to generate prediction results. We then adapt the performance function to obtain theperformance measurement of the true positive rate against the false positive rate. Finally, we use the plot function to visualize the ROC plot, and add the value of AUC on the title. In this example, the AUC value is 0.92, which indicates that the svm classifier performs well in classifying telecom user churn datasets.在这个配方中，我们演示了如何生成一个ROC曲线来说明性能的二进制分类器。首先，我们应该安装和加载库，ROCR。然后，我们使用支持向量机，从e1071包，训练分类模型，然后使用模型预测的测试数据集的标签。接下来，我们使用的预测功能（从包装，使生成的预测结果）。然后，我们适应的性能函数，得到真正的阳性率对假阳性率的性能测量。最后，我们使用的情节功能可视化的ROC图，并添加值的AUC的标题。在这个例子中，AUC值为0.92，这表明，SVM分类器进行分类以及电信用户流失数据集。

See also参见

ff For those interested in the concept and terminology of ROC, you can refer to FF对于那些感兴趣的概念和术语的ROC，可以参考

http://en.wikipedia.org/wiki/Receiver_operating_characteristic

11Comparing an ROC curve using the caret package使用插入符号包ROC曲线比较

In previous chapters, we introduced many classification methods; each method has its own advantages and disadvantages. However, when it comes to the problem of how to choose the best fitted model, you need to compare all the performance measures generated from different prediction models. To make the comparison easy, the caret package allows us to generate and compare the performance of models. In this recipe, we will use the function provided by the caret package to compare different algorithm trained models on the same dataset.在前面的章节中，我们介绍了许多分类方法，每种方法都有自己的优点和缺点。然而，当谈到如何选择最佳拟合模型的问题，你需要比较不同的预测模型所产生的所有性能指标。为了使比较容易，插入包允许我们生成和比较模型的性能。在这个食谱中，我们将使用由符号打包提供比较不同算法训练模型在同一数据库的功能

Getting ready准备

Here, we will continue to use telecom dataset as our input data source.在这里，我们将继续使用电信数据集作为我们的输入数据源。

How to do it...怎么做

Perform the following steps to generate an ROC curve of each fitted model:执行下列步骤来生成每个拟合模型的ROC曲线

1. Install and load the library, pROC:安装和加载库

> install.packages("pROC")

> library("pROC")

2. Set up the training control with a 10-fold cross-validation in 3 repetitions:建立训练控制与10倍交叉验证在3次重复

> control = trainControl(method = "repeatedcv",

+                            number = 10,

+                            repeats = 3,

+                            classProbs = TRUE,

+                            summaryFunction = twoClassSummary)

3. Then, you can train a classifier on the training dataset using glm:然后，你可以训练一个分类器的训练数据集使用GLM

> glm.model= train(churn ~ .,

+                     data = trainset,

1. Resample the three generated models:

重采样三生成的模型：

> cv.values = resamples(list(glm = glm.model, svm=svm.model, rpart

 

= rpart.model))

2. Then, you can obtain a summary of the resampling result:

然后，可以获取重采样结果的摘要：

> summary
　　
Call:

 summary.resamples(object = cv.values)

 Models: glm, svm, rpart

 
Number of resamples: 30

 ROC

 
      Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's

 
glm   0.7206  0.7847 0.8126 0.8116  0.8371 0.8877    0

 
svm   0.8337  0.8673 0.8946 0.8929  0.9194 0.9458    0
 
rpart 0.2802  0.7159 0.7413 0.6769  0.8105 0.8821    0

 Sens

   Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's

 

glm   0.08824  0.2000 0.2286 0.2194  0.2517 0.3529    0

 

svm   0.44120  0.5368 0.5714 0.5866  0.6424 0.7143    0

 

rpart 0.20590  0.3742 0.4706 0.4745  0.5929 0.6471    0

 
Spec

 
  Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's

 
glm   0.9442  0.9608 0.9746 0.9701  0.9797 0.9949    0

 
svm   0.9442  0.9646 0.9746 0.9740  0.9835 0.9949    0

 rpart 0.9492  0.9709 0.9797 0.9780  0.9848 0.9949    0

3. Use dotplot to plot the resampling result in the ROC metric:

使用dotplot在ROC度量采样结果图

> dotplot(cv.values, metric = "ROC")

4. Also, you can use a box-whisker plot to plot the resampling result:

此外，您可以使用一个方块图绘制重采样结果

> bwplot(cv.values, layout = c(3, 1))

How it works...它如何工作

In this recipe, we demonstrate how to measure the performance differences among three fitted models using the resampling method. First, we use the resample function to generate the statistics of each fitted model (svm.model, glm.model, and rpart.model). Then, we can use the summary function to obtain the statistics of these three models in the ROC, sensitivity and specificity metrics. Next, we can apply a dotplot on the resampling result to see how ROC varied between each model. Last, we use a box-whisker plot on the resampling results to show the box-whisker plot of different models in the ROC, sensitivity and specificity metrics on a single plot.

在这个食谱中，我们展示了如何衡量三个拟合模型的性能差异使用重采样方法。首先，我们使用重采样函数生成各拟合模型的统计（svm.model，glm.model，和rpart。模型）。然后，我们可以使用汇总功能，以获得这三个模型在ROC的统计，敏感性和特异性度量。接下来，我们可以应用在重采样的结果怎么看ROC dotplot之间变化，每个模型。最后，我们使用的重采样结果显示不同的模型在ROC，灵敏度和特异性指标在一个单一的地块盒晶须图的盒晶须情节。