MAGMA | gene analysis and generalized gene-set analysis of GWAS data

基于GWAS的分析结果做基因集富集分析。

基本代码:

# download sumstats file from https://ctg.cncr.nl/software/summary_statistics
# download NCBI37 and g1000_eur from https://ctg.cncr.nl/software/magma
# download msigdb.v7.2.entrez.gmt from http://www.gsea-msigdb.org/gsea/downloads.jsp

# Annotation ../magma --annotate --snp-loc g1000_eur/g1000_eur.bim --gene-loc NCBI37/NCBI37.3.gene.loc --out test # Gene analysis - SNP p-values ../magma --bfile g1000_eur/g1000_eur --pval sumstats/SESA_neuro_clus_sumstats.txt N=449484 --gene-annot test.genes.annot --out test.gene # cat msigdb.v7.2.entrez.gmt | cut -f1,3- > formated.msigdb.v7.2.entrez.gmt
# Gene-set analysis ../magma --gene-results test.gene.genes.raw --set-annot geneSet/formated.msigdb.v7.2.entrez.gmt --out GS_test

  

Annotation的结果

就是把SNP的ID归到了gene里,NCBI (Entrez) Gene IDs

# window_up = 0
# window_down = 0
79501   1:69091:70008   rs140739101     rs200505207     rs527512746     rs200676709     rs570453814
100996442       1:142447:174392 rs371474651     rs547574544     rs564576411
148398  1:859993:879961 rs568015169     rs542412615     rs61464428      rs57465118      rs78033073      rs57924093      rs60837925      rs57816
26155   1:879583:894679 rs544040354     rs6605067       rs2839  rs553373622     rs143853699     rs528505070     rs557736773     rs370120035
339451  1:895967:901099 rs113034360     rs112905931     rs532448472     rs144174542     rs562855695     rs531742319     rs28393498      rs56856
84069   1:901872:910488 rs116147894     rs201730138     rs62639981      rs199941754     rs28499371      rs530698349     rs561236606     rs77359
84808   1:910579:917473 rs556054629     rs532303174     rs4970429       rs182213575     rs2340592       rs533147788     rs3748588       rs14146
57801   1:934342:936608 rs186167017     rs569070395     rs201391049     rs200512735     rs113602214     rs531408504     rs2298214       rs14625
9636    1:948847:949920 rs4615788       rs15842 rs28491407      rs140198533     rs2465124       rs568580969     rs371949419     rs148041041

 

Gene analysis的结果

类似GWAS的sumstats的结果,核心就是P-value

GENE       CHR      START       STOP  NSNPS  NPARAM       N        ZSTAT            P
148398       1     859993     879961     92      21  449484      0.33571      0.36854
26155        1     879583     894679     72      13  449484      0.22321      0.41168
339451       1     895967     901099     27       8  449484      0.68675      0.24612
84069        1     901872     910488     36      17  449484       1.9254     0.027088
84808        1     910579     917473     24       6  449484      0.94576      0.17214
57801        1     934342     936608     15       5  449484      0.98676      0.16188
9636         1     948847     949920      9       4  449484      0.55709      0.28873
375790       1     955503     991499    193      32  449484       1.0999      0.13569
401934       1    1007126    1009687      9       4  449484       2.4592    0.0069632
54991        1    1017198    1051736    123      17  449484      0.38241      0.35108
254173       1    1109286    1133315    154      20  449484       1.4226     0.077429
8784         1    1138888    1142163     15       4  449484       1.9105     0.028037
7293         1    1146706    1149703     10       4  449484       1.7823     0.037349
51150        1    1152288    1167447     96       6  449484       0.1506      0.44014
126792       1    1167629    1170421      9       3  449484       2.1349     0.016383
388581       1    1177826    1182102     13       3  449484      0.49731      0.30948
118424       1    1189292    1209234     66      12  449484       1.5713     0.058059
6339         1    1215816    1227409     50      18  449484     0.069082      0.47246
116983       1    1227764    1243304     76      19  449484      0.33408      0.36916

  

Gene-set analysis的结果

以pathway为对象的sumstats结果

# TOTAL_GENES = 18152
# TEST_DIRECTION = one-sided, positive (set), two-sided (covar)
# CONDITIONED_INTERNAL = gene size, gene density, inverse mac, log(gene size), log(gene density), log(inverse mac)
VARIABLE                               TYPE  NGENES         BETA     BETA_STD           SE            P FULL_NAME
BIOCARTA_FEEDER_PATHWAY                 SET       9     -0.24616     -0.00548      0.21239      0.87677 BIOCARTA_FEEDER_PATHWAY
BIOCARTA_PROTEASOME_PATHWAY             SET      17    -0.076691   -0.0023459      0.22729       0.6321 BIOCARTA_PROTEASOME_PATHWAY
BIOCARTA_KREB_PATHWAY                   SET       8     -0.51618    -0.010834      0.40188      0.90049 BIOCARTA_KREB_PATHWAY
ST_INTERFERON_GAMMA_PATHWAY             SET      10      0.53983     0.012667      0.32686      0.04932 ST_INTERFERON_GAMMA_PATHWAY
ST_WNT_CA2_CYCLIC_GMP_PATHWAY           SET      20      0.14579    0.0048366      0.19018      0.22167 ST_WNT_CA2_CYCLIC_GMP_PATHWAY
ST_DIFFERENTIATION_PATHWAY_I...         SET      44    -0.037477    -0.001843      0.14289      0.60345 ST_DIFFERENTIATION_PATHWAY_IN_PC12_CELL
ST_TUMOR_NECROSIS_FACTOR_PAT...         SET      28      0.11224    0.0044051      0.17192      0.25692 ST_TUMOR_NECROSIS_FACTOR_PATHWAY
ST_ERK1_ERK2_MAPK_PATHWAY               SET      29     -0.12764   -0.0050977      0.17231      0.77057 ST_ERK1_ERK2_MAPK_PATHWAY
ST_GA12_PATHWAY                         SET      22      0.11411    0.0039702      0.21219      0.29537 ST_GA12_PATHWAY
ST_G_ALPHA_S_PATHWAY                    SET      16       0.1042    0.0030923      0.22091      0.31858 ST_G_ALPHA_S_PATHWAY
ST_G_ALPHA_I_PATHWAY                    SET      36      0.15883    0.0070663      0.14975      0.14444 ST_G_ALPHA_I_PATHWAY
ST_IL_13_PATHWAY                        SET       5     0.048314   0.00080176       0.4724      0.45927 ST_IL_13_PATHWAY
ST_P38_MAPK_PATHWAY                     SET      36     -0.16636   -0.0074015      0.15474      0.85882 ST_P38_MAPK_PATHWAY
ST_JAK_STAT_PATHWAY                     SET       9      0.28402    0.0063229      0.34345      0.20413 ST_JAK_STAT_PATHWAY

pathway里每一个基因的信息

# ALPHA = 0.05
# NUMBER_OF_TESTS = 31071

# _SET1_  VARIABLE = GO_DISTAL_AXON (set)
# _SET1_  NGENES = 290
# _SET1_  P-VALUE = 1.46129e-06
_SET1_   GENE     CHR      START       STOP  NSNPS  NPARAM       N        ZSTAT            P  ZFITTED_BASE  ZRESID_BASE
_SET1_   116983     1    1227764    1243304     76      19  449484   -0.0069549      0.36916    2.1164e-16   -0.0069549
_SET1_   1855       1    1270658    1284509     53      14  449484       1.3402     0.050325    2.2204e-16       1.3402
_SET1_   127262     1    3541556    3546695     20       6  449484      0.78373      0.12976    2.2204e-16      0.78373
_SET1_   8514       1    6052358    6161253    345      59  449484       2.4031     0.001132             0       2.4031
_SET1_   65018      1   20959948   20978004     89      15  449484     0.064676      0.28667    2.0817e-16     0.064676
_SET1_   10236      1   23635953   23671143     84      16  449484       1.5148      0.02933    2.2204e-16       1.5148
_SET1_   4985       1   29138654   29190208    231      37  449484      0.46618      0.14046    2.2204e-16      0.46618
_SET1_   2899       1   37261128   37499844    562      68  449484       2.5742   0.00053178             0       2.5742
_SET1_   1996       1   50513686   50669442    258      42  449484      0.97766     0.059361    2.2204e-16      0.97766
_SET1_   85440      1   62920397   63154039    399      19  449484     0.064887      0.26601    2.0817e-16     0.064887
_SET1_   4919       1   64239690   64647181   1179      91  449484      -1.5572       0.7101    2.2204e-16      -1.5572
_SET1_   58155      1   97187161   97280605    303      28  449484        1.435     0.022342    2.2204e-16        1.435
_SET1_   22911      1  109512835  109584860    241      32  449484       1.1109      0.04975    2.2204e-16       1.1109
_SET1_   3749       1  110753336  110776674     40      14  449484     -0.78973      0.60279    2.2204e-16     -0.78973
_SET1_   3737       1  111136202  111174096    110      15  449484      0.59466      0.15466    2.2204e-16      0.59466
_SET1_   3738       1  111196182  111217655     49      10  449484      -1.3769      0.81738    2.2204e-16      -1.3769
_SET1_   9860       1  113615792  113667824    127      22  449484     0.032914      0.31657    2.0817e-16     0.032914
_SET1_   57657      1  155247218  155259639     34       8  449484      0.15263      0.28912    2.2204e-16      0.15263
_SET1_   23208      1  155829260  155854990     45      12  449484     0.093267      0.30148    2.0817e-16     0.093267
_SET1_   1314       1  160258377  160313354    113      23  449484       1.2371      0.03771    2.2204e-16       1.2371
_SET1_   55811      1  167778357  167883608    487      74  449484      0.39334       0.1405    2.2204e-16      0.39334
_SET1_   63923      1  175036994  175117202    333      37  449484      0.96139     0.050889    2.2204e-16      0.96139
_SET1_   2752       1  182350839  182361341     40      10  449484      0.27951      0.22106    2.2204e-16      0.27951
_SET1_   10092      1  183595328  183605076     47       3  449484      0.29785      0.33574    2.2204e-16      0.29785
_SET1_   23046      1  200938514  200992828    204      29  449484      0.70264      0.10325    2.2204e-16      0.70264
_SET1_   134        1  203096833  203136533    140      29  449484      0.40025      0.16777    2.2204e-16      0.40025

  

核心原理:

Gene analysis

The gene analysis in MAGMA is based on a multiple linear principal components regression [18] model, using an F-test to compute the gene p-value.

This model first projects the SNP matrix for a gene onto its principal components (PC), pruning away PCs with very small eigenvalues, and then uses those PCs as predictors for the phenotype in the linear regression model.

This improves power by removing redundant parameters, and guarantees that the model is identifiable in the presence of highly collinear SNPs. By default only 0.1% of the variance in the SNP data matrix is pruned away.

  

Gene-set analysis

To perform the gene-set analysis, for each gene g the gene p-value pg computed with the gene analysis is converted to a Z-value zg = Φ−1(1 – pg), where Φ−1 is the probit function. This yields a roughly normally distributed variable Z with elements zg that reflects the strength of the association each gene has with the phenotype, with higher values corresponding to stronger associations.

 

参考链接:

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004219

https://ctg.cncr.nl/software/magma

https://ctg.cncr.nl/software/MAGMA/doc/manual_v1.08.pdf

~/softwares/magma_v1.08b/test_data/

原文地址:https://www.cnblogs.com/leezx/p/14309121.html