在R中练习k-meas聚类

算法优势：适用于绝大多数的数据类型，简洁和快速

算法劣势：需要知道准确的

数据探索

确认数据结构：整体都是数值型的，

1、性别是分类变量，这样的话该变量不能被K聚类识别，需要对性别进行哑变量编码

2、数据里边存在缺失

> str(teens)
'data.frame':    30000 obs. of  40 variables:
 $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
 $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
 $ age         : num  19 18.8 18.3 18.9 19 ...
 $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
 $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ football    : int  0 1 1 0 0 0 0 0 0 0 ...
 $ soccer      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ softball    : int  0 0 0 0 0 0 0 1 0 0 ...
 $ volleyball  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ swimming    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cheerleading: int  0 0 0 0 0 0 0 0 0 0 ...
 $ baseball    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tennis      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ sports      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cute        : int  0 1 0 1 0 0 0 0 0 1 ...
 $ sex         : int  0 0 0 0 1 1 0 2 0 0 ...
 $ sexy        : int  0 0 0 0 0 0 0 1 0 0 ...
 $ hot         : int  0 0 0 0 0 0 0 0 0 1 ...
 $ kissed      : int  0 0 0 0 5 0 0 0 0 0 ...
 $ dance       : int  1 0 0 0 1 0 0 0 0 0 ...
 $ band        : int  0 0 2 0 1 0 1 0 0 0 ...
 $ marching    : int  0 0 0 0 0 1 1 0 0 0 ...
 $ music       : int  0 2 1 0 3 2 0 1 0 1 ...
 $ rock        : int  0 2 0 1 0 0 0 1 0 1 ...
 $ god         : int  0 1 0 0 1 0 0 0 0 6 ...
 $ church      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ jesus       : int  0 0 0 0 0 0 0 0 0 2 ...
 $ bible       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hair        : int  0 6 0 0 1 0 0 0 0 1 ...
 $ dress       : int  0 4 0 0 0 1 0 0 0 0 ...
 $ blonde      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mall        : int  0 1 0 0 0 0 2 0 0 0 ...
 $ shopping    : int  0 0 0 0 2 1 0 0 0 1 ...
 $ clothes     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hollister   : int  0 0 0 0 0 0 2 0 0 0 ...
 $ abercrombie : int  0 0 0 0 0 0 0 0 0 0 ...
 $ die         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ death       : int  0 0 1 0 0 0 0 0 0 0 ...
 $ drunk       : int  0 0 0 0 1 1 0 0 0 0 ...
 $ drugs       : int  0 0 0 0 1 0 0 0 0 0 ...

查看缺失情况

> summary(teens)
    gradyear     gender           age             friends         basketball     
 Min.   :2006   F   :22054   Min.   :  3.086   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.:2007   M   : 5222   1st Qu.: 16.312   1st Qu.:  3.00   1st Qu.: 0.0000  
 Median :2008   NA's: 2724   Median : 17.287   Median : 20.00   Median : 0.0000  
 Mean   :2008                Mean   : 17.994   Mean   : 30.18   Mean   : 0.2673  
 3rd Qu.:2008                3rd Qu.: 18.259   3rd Qu.: 44.00   3rd Qu.: 0.0000  
 Max.   :2009                Max.   :106.927   Max.   :830.00   Max.   :24.0000  
                             NA's   :5086                                        
    football           soccer           softball         volleyball     
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   : 0.2523   Mean   : 0.2228   Mean   : 0.1612   Mean   : 0.1431  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   :15.0000   Max.   :27.0000   Max.   :17.0000   Max.   :14.0000  
                                                                        
    swimming        cheerleading       baseball           tennis        
 Min.   : 0.0000   Min.   :0.0000   Min.   : 0.0000   Min.   : 0.00000  
 1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000  
 Median : 0.0000   Median :0.0000   Median : 0.0000   Median : 0.00000  
 Mean   : 0.1344   Mean   :0.1066   Mean   : 0.1049   Mean   : 0.08733  
 3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.00000  
 Max.   :31.0000   Max.   :9.0000   Max.   :16.0000   Max.   :15.00000  
                                                                        
     sports           cute              sex                sexy        
 Min.   : 0.00   Min.   : 0.0000   Min.   :  0.0000   Min.   : 0.0000  
 1st Qu.: 0.00   1st Qu.: 0.0000   1st Qu.:  0.0000   1st Qu.: 0.0000  
 Median : 0.00   Median : 0.0000   Median :  0.0000   Median : 0.0000  
 Mean   : 0.14   Mean   : 0.3229   Mean   :  0.2094   Mean   : 0.1412  
 3rd Qu.: 0.00   3rd Qu.: 0.0000   3rd Qu.:  0.0000   3rd Qu.: 0.0000  
 Max.   :12.00   Max.   :18.0000   Max.   :114.0000   Max.   :18.0000  
                                                                       
      hot              kissed            dance              band        
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   : 0.1266   Mean   : 0.1032   Mean   : 0.4252   Mean   : 0.2996  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   :10.0000   Max.   :26.0000   Max.   :30.0000   Max.   :66.0000  
                                                                        
    marching           music              rock              god         
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   : 0.0406   Mean   : 0.7378   Mean   : 0.2433   Mean   : 0.4653  
 3rd Qu.: 0.0000   3rd Qu.: 1.0000   3rd Qu.: 0.0000   3rd Qu.: 1.0000  
 Max.   :11.0000   Max.   :64.0000   Max.   :21.0000   Max.   :79.0000  
                                                                        
     church            jesus             bible               hair        
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.00000   Median : 0.0000  
 Mean   : 0.2482   Mean   : 0.1121   Mean   : 0.02133   Mean   : 0.4226  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.00000   3rd Qu.: 0.0000  
 Max.   :44.0000   Max.   :30.0000   Max.   :11.00000   Max.   :37.0000  
                                                                         
     dress           blonde              mall            shopping     
 Min.   :0.000   Min.   :  0.0000   Min.   : 0.0000   Min.   : 0.000  
 1st Qu.:0.000   1st Qu.:  0.0000   1st Qu.: 0.0000   1st Qu.: 0.000  
 Median :0.000   Median :  0.0000   Median : 0.0000   Median : 0.000  
 Mean   :0.111   Mean   :  0.0989   Mean   : 0.2574   Mean   : 0.353  
 3rd Qu.:0.000   3rd Qu.:  0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.000  
 Max.   :9.000   Max.   :327.0000   Max.   :12.0000   Max.   :11.000  
                                                                      
    clothes         hollister        abercrombie           die         
 Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   : 0.0000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 0.0000  
 Median :0.0000   Median :0.00000   Median :0.00000   Median : 0.0000  
 Mean   :0.1485   Mean   :0.06987   Mean   :0.05117   Mean   : 0.1841  
 3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 0.0000  
 Max.   :8.0000   Max.   :9.00000   Max.   :8.00000   Max.   :22.0000  
                                                                       
     death             drunk             drugs         
 Min.   : 0.0000   Min.   :0.00000   Min.   : 0.00000  
 1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.: 0.00000  
 Median : 0.0000   Median :0.00000   Median : 0.00000  
 Mean   : 0.1142   Mean   :0.08797   Mean   : 0.06043  
 3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.: 0.00000  
 Max.   :14.0000   Max.   :8.00000   Max.   :16.00000

缺失处理

> table(teens$gender,useNA = 'ifany')#返回缺失个数

    F     M  <NA> 
22054  5222  2724

> teens$female<-ifelse(teens$gender=='F'&!is.na(teens$gender),1,0)#将性别进行哑变量处理，首先将非空的女性赋值为1，NA或者男赋值为0

> teens$no_gender<-ifelse(is.na(teens$gender),1,0)#性别缺失为1，否则为0，这样的话相当于把男女哑变量处理，同时缺失也纳入考量

> teens$gender<-NULL

年龄处理：高中生不可能106岁，也不可能有3岁，同时存在较多缺失

> summary(teens$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  3.086  16.310  17.290  17.990  18.260 106.900    5086

> teens$age <- ifelse(teens$age >= 16 & teens$age < 20,teens$age,NA)#将16-20岁视为高中生的正常年龄，其他数据视为异常值为空
> aggregate(data=teens,age~gradyear,mean,na.rm=TRUE)#求不同毕业年份的年龄均值，越最近毕业的人越年轻
  gradyear      age
1     2006 18.68560
2     2007 17.71878
3     2008 16.78118
4     2009 16.33252

# 按年份生成与原数据集合等长的向量

> ave_age <- ave(teens$age, teens$gradyear, FUN = function(x) { return(mean(x, na.rm = TRUE)) })

> teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)#将NA的年龄赋值为均值，非空的保持原值
> summary(teens$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.00   16.33   17.25   17.38   18.22   20.00

数据划分

> interests <- teens[, 4:39]
> interests_z <- data.frame(lapply(interests, scale))#标准化数据，减少量纲差异

k均值建模

set.seed(11)#设定随机种子
teen_clusters <- kmeans(interests_z, 5)

评估模型

> head(teen_clusters$cluster)#聚类标签
[1] 2 1 2 2 5 2
> teen_clusters$size#5个类别的数量
[1]  5233 22182   731   803  1051
> teen_clusters$centers#聚类中心
  basketball    football      soccer     softball  volleyball    swimming cheerleading    baseball
1  0.4531058  0.43339400 -0.05953070  0.375996747  0.40342075  0.30482805   0.46479149  0.24682651
2 -0.1290187 -0.12977694 -0.14899365 -0.097191693 -0.10066663 -0.09079804  -0.11397570 -0.06918139
3  0.2457823  0.23577386  5.02937767  0.033589676  0.09057228  0.13165120  -0.02646162  0.03223472
4 -0.1109113  0.03602234 -0.14104152 -0.007481003 -0.07966170  0.04595234  -0.12500102 -0.11042346
5  0.3807668  0.38961431  0.05069711  0.161530303  0.11384059  0.27191272   0.20520869  0.29309699
       tennis     sports        cute          sex         sexy         hot      kissed       dance
1  0.13565618  0.2229282  0.65570947  0.005138338  0.230127643  0.50391601 -0.02252810  0.55362283
2 -0.04152949 -0.1020853 -0.17622878 -0.094412671 -0.077760392 -0.13325067 -0.13207336 -0.15193168
3  0.11156263  0.3998271  0.01361179 -0.056096990 -0.008332101  0.06411155 -0.06295344 -0.02447142
4  0.01655266 -0.1041384 -0.04851623 -0.046718513 -0.019765548 -0.06929059 -0.06814111  0.02098573
5  0.11082240  0.8460723  0.48219547  2.041764818  0.516256470  0.31165437  2.99550632  0.45107250
         band    marching       music        rock           god      church       jesus
1 -0.04813421 -0.12344684  0.26060636  0.18108418  0.3304953087  0.51972951  0.24097725
2 -0.12756333 -0.13513757 -0.13216778 -0.10905451 -0.1001591088 -0.13566360 -0.06601809
3 -0.04525084 -0.09376867  0.04330389  0.09518405 -0.0001367759  0.10922246  0.01664811
4  3.37541403  4.63879117  0.39528646  0.16181449  0.0735799460  0.03691392  0.07066919
5  0.38450719 -0.01216509  1.15977396  1.21019696  0.4122385221  0.17132449  0.12793743
        bible         hair        dress      blonde        mall    shopping     clothes
1  0.22258000  0.325589087  0.444453842  0.04528480  0.64138010  0.85834274  0.51465115
2 -0.05930602 -0.198168525 -0.132551874 -0.02830573 -0.17908734 -0.22075867 -0.18246937
3  0.03613298 -0.004854472  0.005732568  0.03146825  0.03512586  0.21142729 -0.04501739
4  0.04788738 -0.041196192  0.049581302 -0.01631323 -0.09605527 -0.06611646  0.01252734
5  0.08173007  2.596190010  0.542753982  0.36251061  0.63523179  0.28896231  1.31038466
    hollister  abercrombie          die       death       drunk       drugs
1  0.57661846  0.557958327  0.043023482  0.12784283  0.01420576 -0.05433707
2 -0.15090261 -0.146583539 -0.089778750 -0.07512888 -0.08530937 -0.11084192
3  0.06677613  0.002922202 -0.003454604 -0.01723961 -0.04902532 -0.04029186
4 -0.15478823 -0.147391253 -0.013624100  0.02928218 -0.08311224 -0.08479959
5  0.38568876  0.426202832  1.693430840  0.93872106  1.82737429  2.70274859

将聚类标签赋值回原记录，在本博文SPSS实例中SPSS聚类之后的结果也会生成新列，便于针对目标类别进行进一步特征提取

> names(teens)
 [1] "gradyear"     "age"          "friends"      "basketball"   "football"     "soccer"      
 [7] "softball"     "volleyball"   "swimming"     "cheerleading" "baseball"     "tennis"      
[13] "sports"       "cute"         "sex"          "sexy"         "hot"          "kissed"      
[19] "dance"        "band"         "marching"     "music"        "rock"         "god"         
[25] "church"       "jesus"        "bible"        "hair"         "dress"        "blonde"      
[31] "mall"         "shopping"     "clothes"      "hollister"    "abercrombie"  "die"         
[37] "death"        "drunk"        "drugs"        "no_gender"    "female"      
> teens$cluster <- teen_clusters$cluster
> teens[1:5, c("cluster", "female", "no_gender", "age", "friends")]
  cluster female no_gender    age friends
1       2      0         0 18.982       7
2       1      1         0 18.801       0
3       2      0         0 18.335      69
4       2      1         0 18.875       0
5       5      0         1 18.995      10

特征查看

  cluster female no_gender    age friends
1       2      0         0 18.982       7
2       1      1         0 18.801       0
3       2      0         0 18.335      69
4       2      1         0 18.875       0
5       5      0         1 18.995      10
> aggregate(data = teens, age ~ cluster, mean)#年龄区分度较小
  cluster      age
1       1 17.19971
2       2 17.43109
3       3 17.16270
4       4 17.47728
5       5 17.26249
> aggregate(data = teens, female ~ cluster, mean)
  cluster    female
1       1 0.8696732
2       2 0.6995312
3       3 0.7633379
4       4 0.7272727
5       5 0.8030447
> aggregate(data = teens, friends ~ cluster, mean)#
  cluster  friends
1       1 39.05828
2       2 27.78077
3       3 35.73735
4       4 32.95392
5       5 30.61180

优化模型，调整K

> set.seed(11)
> teen_clusters <- kmeans(interests_z, 10)
> teens$cluster <- teen_clusters$cluster

> table(teens$cluster)

    1     2     3     4     5     6     7     8     9    10 
  424   540  1993  2478   680   519   856  1917   996 19597 
> data.frame(aggregate(data = teens, football ~ cluster, mean), aggregate(data = teens, friends ~ cluster, mean), aggregate(data = teens, basketball ~ cluster, mean))
   cluster  football cluster.1  friends cluster.2 basketball
1        1 0.2900943         1 35.01179         1  0.3584906
2        2 0.2611111         2 35.24815         2  0.3425926
3        3 0.2483693         3 33.29704         3  0.2498746
4        4 0.4293785         4 39.57143         4  0.3950767
5        5 0.3941176         5 35.54265         5  0.4235294
6        6 1.2215800         6 32.96532         6  0.9441233
7        7 0.5537383         7 30.65070         7  0.5911215
8        8 0.2008346         8 32.38341         8  0.2034429
9        9 0.4608434         9 40.26506         9  1.0923695
10      10 0.1799255        10 27.42195        10  0.1757412
> names(teens)
 [1] "gradyear"     "age"          "friends"      "basketball"   "football"     "soccer"      
 [7] "softball"     "volleyball"   "swimming"     "cheerleading" "baseball"     "tennis"      
[13] "sports"       "cute"         "sex"          "sexy"         "hot"          "kissed"      
[19] "dance"        "band"         "marching"     "music"        "rock"         "god"         
[25] "church"       "jesus"        "bible"        "hair"         "dress"        "blonde"      
[31] "mall"         "shopping"     "clothes"      "hollister"    "abercrombie"  "die"         
[37] "death"        "drunk"        "drugs"        "female"       "no_gender"    "cluster"     
> data.frame(aggregate(data = teens, soccer ~ cluster, mean), aggregate(data = teens, softball ~ cluster, mean), aggregate(data = teens, volleyball ~ cluster, mean))
   cluster     soccer cluster.1   softball cluster.2 volleyball
1        1 0.16273585         1 0.15094340         1 0.11556604
2        2 0.16851852         2 0.14629630         2 0.12407407
3        3 0.15805319         3 0.09081786         3 0.12995484
4        4 0.16384181         4 0.12227603         4 0.13841808
5        5 4.94264706         5 0.11764706         5 0.13823529
6        6 0.31791908         6 0.21965318         6 0.10789981
7        7 0.26752336         7 0.23481308         7 0.19742991
8        8 0.09911320         8 0.08242045         8 0.07563902
9        9 0.19076305         9 2.55522088         9 1.99899598
10      10 0.08506404        10 0.05669235        10 0.05720263

可以看到每个类中不同变量的均值情况，可以手动将相似的进行合并，例如1、2中交友数量这个角度1、2、5都很接近，而soccer角度1、2、4又很接近，可以将所有的变量的均值进行查看判断进行人工合并分类。

仅对聚类算法的使用做一个练习，有不妥之处望指教，学习体验是：学无止境......永远有不确定的地方，永远有不懂的东西......刺激又恐慌