相关性分析

1、   计算相关系数

(1)     cor()函数可以计算以下三种相关系数:

(2)     Pearson 极差相关系数:两个连续变量之间的线性相关程度。

(3)     Spearman 等级相关系数:等级变量之间的相关程度。

(4)     Kendall 等级相关系数:非参数的等级相关度量。

(5)     语法:cor(data, use= ,  method=)

data:矩阵或数据框;

use:缺失数据的处理方式。

  all.obs:假设不存在缺失数据,遇到缺失数据将报错。

  everything:遇到缺失数据时,相关系数的计算结果将被设置为 missing ;

  complete.obs:行删除;

  pairwise.obs: 成对删除。

       method:指定相关系数的类型。pearson、spearman、kendall。

原示例

> states<- state.x77[, 1:6]

> x<- states[,c("Population", "Income", "Illiteracy","HS Grad")]

> y<-states[,c("Life Exp","Murder")]

> cor(x,y)

结果:

              Life Exp     Murder

Population -0.06805195  0.3436428

Income      0.34025534 -0.2300776

Illiteracy -0.58847793  0.7029752

HS Grad     0.58221620 -0.4879710

探索 房子单价与 面积,所在楼层,总层高之间的相关性

数据准备

> house<- read.table("house_data.txt", header = TRUE, sep='|',fileEncoding ="UTF-8",

+                    stringsAsFactors = FALSE,

+                    colClasses = c("character","character","numeric",

+                                   "character","numeric","numeric","character",

+                                   "numeric","numeric","character"))

>

> houseXQ<- sqldf("select * from house where  community_name!='东郊小镇' ",row.names=TRUE)

Error in sqldf("select * from house where  community_name!='东郊小镇' ",  :

  could not find function "sqldf"

> library(sqldf)

载入需要的程辑包:gsubfn

载入需要的程辑包:proto

载入需要的程辑包:RSQLite

> houseXQ<- sqldf("select * from house where  community_name!='东郊小镇' ",row.names=TRUE)

> communityFactor<- factor(houseXQ$community_name, order=FALSE)

> houseXQ <-cbind(houseXQ, communityFactor)

总价与面积,当前楼层,总层高,单价的相关性

x<- houseXQ [,c("house_total")]

y<- houseXQ [,c("house_area","house_floor_curr","house_floor_total","house_avg")]

> cor(x,y)

结果:

house_area house_floor_curr house_floor_total house_avg

[1,]  0.9450675      -0.02058832        0.03570221 0.4395242

总价与面积高度相关。

相关系统的显著性检测:由结果可见,它们高度相关

cor.test(houseXQ[, c("house_total")],houseXQ[, c("house_area")] )

Pearson's product-moment correlation

 

data:  houseXQ[, c("house_total")] and houseXQ[, c("house_area")]

t = 39.537, df = 187, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

 0.9274393 0.9585053

sample estimates:

      cor

0.9450675

单价与 面积,当前楼层,总层高,总价的相关性

x<- houseXQ [,c("house_avg")]

y<- houseXQ [,c("house_area","house_floor_curr","house_floor_total","house_total")]

cor(x,y)

结果:

house_area house_floor_curr house_floor_total house_total

[1,]  0.1659645        0.2139952         0.3024903   0.4395242

原文地址:https://www.cnblogs.com/quietwalk/p/8301342.html