（转）决定系数R2

有些讲得太烂了，我来通俗的梳理一下R2.

Calculating R-squared

在线性回归的模型下，我们可以计算SE(line), SE(y均值)。

The statistic R2describes the proportion of variance in the response variable explained by the predictor variable

如何理解这句话，Y本身就有自己的SE，在模型下，Y与其预测值之间又有一个SE，如果模型完全拟合，那么SE(line)=0. 此时的R2就是1，也就是所有的方差都被该模型解释了（可以想象成一种完全过拟合的模型。）

决定系数（coefficient ofdetermination），有的教材上翻译为判定系数，也称为拟合优度。

决定系数反应了y的波动有多少百分比能被x的波动所描述，即表征依变数Y的变异中有多少百分比,可由控制的自变数X来解释.

决定系数的数值恰巧等于相关系数的平方。

表达式：R2=SSR/SST=1-SSE/SST

其中：SST=SSR+SSE，SST(total sum of squares)为总平方和，SSR(regression sum of squares)为回归平方和，SSE(error sum of squares) 为残差平方和。

数据的组间变异/总变异*100%，就是所谓的R-square.

组内变异（SSE）+组间变异（SSA）=总变异（SST），可以推出公式R squared=1-SSE/SST；具体组内变异和组间变异及总变异的计算估计你会的就不写了。

回归平方和：SSR(Sum of Squares forregression) = ESS (explained sum of squares)

残差平方和：SSE（Sum of Squares for Error） = RSS(residual sum of squares)

总离差平方和：SST(Sum of Squares fortotal) = TSS(total sum of squares)

SSE+SSR=SST RSS+ESS=TSS

意义：拟合优度越大，自变量对因变量的解释程度越高，自变量引起的变动占总变动的百分比高。观察点在回归直线附近越密集。

取值范围：0-1.

举例：

假设有10个点，如下图：

我们R来实现如何求线性方程和R2：

# 线性回归的方程
mylr = function(x,y){
  
  plot(x,y)
  
  x_mean = mean(x)
  y_mean = mean(y)
  xy_mean = mean(x*y)
  xx_mean = mean(x*x)
  yy_mean = mean(y*y)
  
  m = (x_mean*y_mean - xy_mean)/(x_mean^2 - xx_mean)
  b = y_mean - m*x_mean
  
  
  f = m*x+b# 线性回归方程
  
  lines(x,f)
  
  sst = sum((y-y_mean)^2)
  sse = sum((y-f)^2)
  ssr = sum((f-y_mean)^2)
  
  result = c(m,b,sst,sse,ssr)
  names(result) = c('m','b','sst','sse','ssr')
  
  return(result)
}
 
x = c(60,34,12,34,71,28,96,34,42,37)
y = c(301,169,47,178,365,126,491,157,202,184)
 
f = mylr(x,y)
 
f['m']
f['b']
f['sse']+f['ssr']
f['sst']
 
R2 = f['ssr']/f['sst']

最后方程为：f(x)=5.3x-15.5

R2为99.8，说明x对y的解释程度非常高。

---------------------
作者：snowdroptulip
来源：CSDN
原文：https://blog.csdn.net/snowdroptulip/article/details/79022532
版权声明：本文为博主原创文章，转载请附上博文链接！