实战--利用SVM对基因表达标本是否癌变的预测

利用支持向量机对基因表达标本是否癌变的预测

As we mentioned earlier, gene expression analysis has a wide variety of applications, including cancer studies. In 1999, Uri Alon analyzed gene expression data for 2,000 genes from 40 colon tumor tissues and compared them with data from colon tissues belonging to 21 healthy individuals, all measured at a single time point. We can represent his data as a 2,000 × 61 gene expression matrix, where the first 40 columns describe tumor samples and the last 21 columns describe normal samples.

Now, suppose you performed a gene expression experiment with a colon sample from a new patient, corresponding to a 62nd column in an augmented gene expression matrix. Your goal is to predict whether this patient has a colon tumor. Since the partition of tissues into two clusters (tumor vs. healthy) is known in advance, it may seem that classifying the sample from a new patient is easy. Indeed, since each patient corresponds to a point in 2,000-dimensional space, we can compute the center of gravity of these points for the tumor sample and for the healthy sample. Afterwards, we can simply check which of the two centers of gravity is closer to the new tissue.

Alternatively, we could perform a blind analysis, pretending that we do not already know the classification of samples into cancerous vs. healthy, and analyze the resulting 2,000 x 62 expression matrix to divide the 62 samples into two clusters. If we obtain a cluster consisting predominantly of cancer tissues, this cluster may help us diagnose colon cancer.

Final Challenge: These approaches may seem straightforward, but it is unlikely that either of them will reliably diagnose the new patient. Why do you think this is? Given Alon’s 2,000 × 61 gene expression matrix and gene data from a new patient, derive a superior approach to evaluate whether this patient is likely to have a colon tumor.

一、原理

参见

https://www.cnblogs.com/dfcao/p/3462721.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

二、

数据：

40 Cancer Samples

21 Healthy Samples

Unknown Sample

问题分析：

这是一个分类问题，训练集有61个，特征量有2000个，如果利用高斯核函数的SVM会出现过拟合，故选择线性核函数

代码

 1 from os.path import dirname
 2 import numpy as np
 3 import math
 4 import random
 5 import matplotlib.pyplot as plt
 6 from sklearn import datasets, svm
 7 
 8 def Input():
 9     X = []
10     Y = []
11     check_x=[]
12     check_y=[]
13     
14     dataset1 = open(dirname(__file__)+'colon_cancer.txt').read().strip().split('
')
15     dataset1=[list(map(float,line.split()))[:] for line in dataset1]
16     X += dataset1[10:]
17     check_x += dataset1[:10]
18     Y += [1]*(len(dataset1)-10)
19     check_y += [1]*10
20     
21     dataset2 = open(dirname(__file__)+'colon_healthy.txt').read().strip().split('
')
22     dataset2=[list(map(float,line.split()))[:] for line in dataset2]
23     X += dataset2[5:]
24     check_x += dataset2[:5]
25     Y += [0]*(len(dataset2)-5)
26     check_y += [0]*5
27     
28     dataset3 = open(dirname(__file__)+'colon_test.txt').read().strip().split('
')
29     test_X = [list(map(float,line.split()))[:] for line in dataset3]
30     
31     
32     return [X ,Y , test_X , check_x , check_y]
33 
34 if __name__ == '__main__':
35     INF = 999999
36     
37     [X_train ,y_train , test_X,check_x, check_y] = Input()
38     
39     kernel = 'linear' # 线性核函数
40     
41     clf = svm.SVC(kernel=kernel, gamma=10)
42     clf.fit(X_train,y_train)
43     
44     predict_for_ckeck = clf.predict(check_x)
45     cnt=0
46     for i in range(len(check_y)):
47         if check_y[i]==predict_for_ckeck[i]:
48             cnt+=1
49     print('Accuracy %.2f%%'%(cnt/len(check_y)))
50     
51     print(clf.predict(test_X))

Accuracy 87%
[0]

奇怪的是，只选择前20个基因进行分析，训练集预测正确率居然上升到90%

Accuracy 93%

[0]