UDA机器学习基础—交叉验证

交叉验证的目的是为了有在训练集中有更多的数据点，以获得最佳的学习效果，同时也希望有跟多的测试集数据来获得最佳验证。交叉验证的要点是将训练数据平分到k个容器中，在k折交叉验证中，将运行k次单独的试验，每一次试验中，你将挑选k个训练集中的一个作为验证集，剩下k-1个作为训练集，训练你的模型，用测试集测试你的模型。这样运行k次，有十个不同的测试集，将十个测试集的表现平均，就是将这k次试验结果取平均。这样你就差不多用了全部数据去训练，也用全部数据去测试。

#!/usr/bin/python


"""
    Starter code for the validation mini-project.
    The first step toward building your POI identifier!

    Start by loading/formatting the data

    After that, it's not our code anymore--it's yours!
"""

import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)
features_train,features_test,labels_train,labels_test=train_test_split(features,labels,test_size=0.3,random_state=42)
from sklearn.tree import DecisionTreeClassifier
dlf=DecisionTreeClassifier()
dlf.fit(features_train ,labels_train)
f=dlf.predict(features_test)
print accuracy_score(f,labels_test)



### it's all yours from here forward!