模型选择和训练/验证/测试数据集

对于过拟合现象

[{h_ heta }left( x ight) = { heta _0} + { heta _1}x + { heta _2}{x^2} + { heta _3}{x^3} + { heta _4}{x^4}]

Once parameters θ₀,θ₁,θ₂,θ₃,θ₄ were fit to some set of data (training set), the error of the parameters as measured on that data (the training error J(θ)) is likely to be lower than the actual generalization error.

一旦参数θ₀,θ₁,θ₂,θ₃,θ₄适合某些数据集（训练集），在该数据上测量的参数误差（训练误差J（θ））可能低于实际值泛化错误（在测试集上的错误）。

假设又如下模型

[egin{array}{l}
{h_ heta }left( x ight) = { heta _0} + { heta _1}x\
{h_ heta }left( x ight) = { heta _0} + { heta _1}x + { heta _2}{x^2}\
.\
.\
.\
{h_ heta }left( x ight) = { heta _0} + { heta _1}x + ... + { heta _{10}}{x^{10}}
end{array}]

该选择哪一个？

一般情况下我们会用以下步骤选择模型

运用训练集训练模型得到参数θ
将不同模型得到的假设函数运用于测试集
找出在测试集中误差最小的模型

这样做的问题在于你的模型选择依赖于测试集，你是根据模型对于测试集的表现选择模型的，这样做对于新的数据表现如何并不能很好的保障。

解决方法是

将数据分为训练集（Training set）60%、交叉验证集（Cross validation set）20%、测试集（Test set）20%。

运用交叉验证集去选取模型，而不是测试集。