统计与概率论

首先,介绍几个概念:

1.  IQR:interquartile range. = Q3- Q1

    outlier: 异常值。

2.  如何判断一个值是否为outlier?? ------ 使用1.5* IQR rule。

例如: 一系列数据为: 1,1,6,13,13,14,14,14,15,15,16,18,18,18,19。 

  step 1:  找到1/4位数,1/2即中位数,3/4位数。此题中,Q1=13, median=14, Q3=18.

  step 2: calculate IQR = Q3-Q1 = 18-13=5. 

  step 3:  judge.  If a number x< Q1-1.5*IQR   ,   OR    x > Q3+1.5*IQR,    x is an outlier. For example, given a nunmber 30. Because  30 >18+1.5*5=25.5,   30 is judged to be a outlier. 

   以上。


3. population variance 的另一种计算公式:

      ∑ (xi^2)/n  - µ^2 

4. 什么是z-score?

  z-score, is also called z-value, standard score, normal score...   In normal distribution, given a value x,   z-score is equal to  (x - µ)/sigma .   Z- score shows how far away a single data is from the mean relatively

   For example, a normal distribution with µ =81, sigma=6.3.    for x=65,  Z-score= (65-81)/6.3 = -2.54 3. empirical rule?

    68-95-99.7 rule

 5. how to calculate correlation coefficient r. 

      ∑ (Zxi * Zyi) / (n-1)  ,  in which Zxi means Z-score of variate x. 


6. In linear regression, formula is as follows:

           y_pred= mx + b   (1)

   in (1), m is calculated by 

            m = r * Sy/Sx.    (2)

 例如, 4 scatters, namely, (1,1), (2,2), (2,3), (3,6), giving a linear regression formula. 

  Step 1, By calculating, we get  x(mean) = 2, Sx=0.816;  y(mean)=3, Sy=2.16. 

      Step 2, calculating r,  r= ∑ (Zxi * Zyi) / (n-1) = 0.946 

      Step 3, calculating m,  m= r * Sy/Sx = 2.5 

      Step 4, 将(x_mean, y_mean), 也就是(2,3) 带入 (1),  we get the result: 

            y_pred=2.5 x -2  


5. what's coefficient of determination ? 

   r^2 is called  coefficient of determination. 

  (1)  SE(y_mean) = (y1-y_mean)^2+ (y2-y_mean)^2 + (y3-y_mean)^2+ .....

  (2)  SE(line) = (y1-y1_pred) ^2 + (y2-y2_pred)^2 + (y3-y3_pred)^3 + ....

  (3)  r^2  =  1 -  SE(line)/SE(y_mean). 

例如,         1⃣️ 对于非线性回归,SE(y_mean) is 41.1879

            2⃣️ 对于linear regression,SE(line)  is  13.7627.

       so, according to 1⃣️2⃣️,r^2 = 1-  13.762/ 41.187966.59%

       thus, 0.6659 is the coefficient of determination, 66.59%,  也表示 how well this line could fit these data.

 


7. What's Root-mean-square deviation (RMSD) ,

   It's also called "standard deviation of residuals "

  Ri is residual, which calculated as follows, 

    Ri = y - y_pred       (1)       

 so,

   RMSD =   ∑ ( Ri^2 ) / n-1    (2)


 8. 另一种求linear regression的方法:分别求m跟b的偏导,然后令等于0,解二元一次方程组,结果如下:

 

参考:可汗学院 :https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data  

                         https://www.khanacademy.org/math/statistics-probability/modeling-distributions-of-data

        https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/scatterplots-and-correlation/v/calculating-correlation-coefficient-r

        https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/assessing-the-fit-in-least-squares-regression/v/r-squared-or-coefficient-of-determination

原文地址:https://www.cnblogs.com/yyagrt/p/11408190.html