基于基因表达的打分模型

Cell cycle analysis was conducted using the cyclone function in the scran R package (39). For every cell cluster, each cell was assigned into G1, G2M and S phases.

可以学习一下，来自：PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data

根据基因表达来打分的需求有很多很多：

细胞周期打分
增殖指数 - 干细胞、癌细胞
迁移指数 - 某些特定细胞，如ENCC
代谢指数

参考文章：

Proliferation Index: A Continuous Model to Predict Prognosis in Patients with Tumours of the Ewing's Sarcoma Family

Personalized Prediction of Proliferation Rates and Metabolic Liabilities in Cancer Biopsies

midbrain那篇Cell

LGE那篇Nature

Morphology-based prediction of cancer cell migration using an artificial neural network and a random decision forest - 不是基于基因表达的

Quantifying rates of cell migration and cell proliferation in co-culture barrier assays reveals how skin and melanoma cells interact during melanoma spreading and invasion.

Research Techniques Made Simple: Analysis of Collective Cell Migration Using the Wound Healing Assay

Cell Migration in Development and Disease

Increased cell proliferation and neurogenesis in the adult human Huntington's disease brain

做的人很多，现在想做必须加入一些新特征，cell type区分等。

cell cycle score

Removal of cell cycle effect.

The normalization method described above aims to reduce the effect of technical factors in scRNA-seq data (primarily, depth) from downstream analyses. However, heterogeneity in cell cycle stage, particularly among mitotic cells transitioning between S and G2/M phases, also can drive substantial transcriptomic variation that can mask biological signal. To mitigate this effect, we use a two-step approach:

1) quantify cell cycle stage for each cell using supervised analyses with known stage-specific markers,

2) regress the effect of cell cycle stage using the same negative binomial regression as outlined above.

For the first step we use a previously published list of cell cycle dependent genes (43 S phase genes, 54 G2/M phase genes) for an enrichment analysis similar to that proposed in ref. 11.

For each cell, we compare the sum of phase-specific gene expression (log10 transformed UMIs) to the distribution of 100 random background genes sets, where the number of background genes is identical to the phase gene set, and the background genes are drawn from the same expression bins. Expression bins are defined by 50 non-overlapping windows of the same range based on log10(mean UMI). The phase-specific enrichment score is the expression z-score relative to the mean and standard deviation of the background gene sets. Our final ‘cell cycle score’ (Extended Data Fig. 1) is the difference between S-phase score and G2/M-phase score.

除了要normalize测序深度，还要考虑细胞周期的阶段（总共有四个细胞周期stage），尤其是S期和G2/M期。这里首先定量哪些已知的细胞周期的marker (43 S phase genes, 54 G2/M phase genes)，其次，比较总和/100个随机的背景基因；背景基因来自相同的bin（50 non-overlapping windows of the same range），最后计算了一个z-score。

note: A z-score for a sample indicates the number of standard deviations away from the mean of expression in the reference. The formula is : z = (expression in tumor sample - mean expression in reference sample) / standard deviation of expression in reference sample.

For a final normalized dataset with cell cycle effect removed, we perform negative binomial regression with technical factors and cell cycle score as predictors. Although the cell cycle activity was regressed out of the data for downstream analysis, we stored the computed cell cycle score before regression, enabling us to remember the mitotic phase of each individual cell. Notably, our regression strategy is tailored to mitigate the effect of transcriptional heterogeneity within mitotic cells in different phases, and should not affect global differences between mitotic and non-mitotic cells that may be biologically relevant.

声称自己的normalization只去除了mitotic cells内部的差异。

# input S markers, and N=100, and gene bin (all genes are divided to 33 bins)
get.bg.lists <- function(goi, N, expr.bin) {
  res <- list()
  # check the marker local in which bin, and table the count
  goi.bin.tab <- table(expr.bin[goi])
  for (i in 1:N) {
  	# each time, random select 38 background genes and merge
    res[[i]] <- unlist(lapply(names(goi.bin.tab), function(b) {
      # find the genes in this bin & genes not in the markers
      sel <- which(expr.bin == as.numeric(b) & !(names(expr.bin) %in% goi))
      # random sample, count corespond to the number of marker
      sample(names(expr.bin)[sel], goi.bin.tab[b])
    }))
  }
  return(res)
}

# input log2 expr, S markers, 100 background genes
enr.score <- function(expr, goi, bg.lst) {
  # S markers mean of each cell
  goi.mean <- apply(expr[goi, ], 2, mean)
  # get background genes mean by 100 samples
  bg.mean <- sapply(1:length(bg.lst), function(i) apply(expr[bg.lst[[i]], ], 2, mean))
  # return z-score
  return((goi.mean - apply(bg.mean, 1, mean)) / apply(bg.mean, 1, sd))
}


get.cc.score <- function(cm, N=100, seed=42) {
  set.seed(seed)
  cat('get.cc.score, ')
  cat('number of random background gene sets set to', N, '
')
  min.cells <- 5 # min detected gene number  
  cells.mols <- apply(cm, 2, sum) # total umi
  gene.cells <- apply(cm>0, 1, sum) # total detected gene number
  cm <- cm[gene.cells >= min.cells, ] # filter
  gene.mean <- apply(cm, 1, mean) # gene mean
  # divide all genes to 50 bins
  breaks <- unique(quantile(log10(gene.mean), probs = seq(0,1, length.out = 50)))
  gene.bin <- cut(log10(gene.mean), breaks = breaks, labels = FALSE) # table(gene.bin) to see count
  names(gene.bin) <- rownames(cm)
  gene.bin[is.na(gene.bin)] <- 0
  regev.s.genes <- read.table(file='./annotation/s_genes.txt', header=FALSE, stringsAsFactors=FALSE)$V1
  regev.g2m.genes <- read.table(file='./annotation/g2m_genes.txt', header=FALSE, stringsAsFactors=FALSE)$V1
  # list for go, mouse to human
  goi.lst <- list('S'=rownames(cm)[!is.na(match(toupper(rownames(cm)), regev.s.genes))],
                  'G2M'=rownames(cm)[!is.na(match(toupper(rownames(cm)), regev.g2m.genes))])
  
  n <- min(40, min(sapply(goi.lst, length))) # 38
  # sort the marker by their expression
  goi.lst <- lapply(goi.lst, function(x) x[order(gene.mean[x], decreasing = TRUE)[1:n]])
  # double list, background gene list
  bg.lst <- list('S'=get.bg.lists(goi.lst[['S']], N, gene.bin),
                 'G2M'=get.bg.lists(goi.lst[['G2M']], N, gene.bin))
  # merge all genes, markers  + background, sort by name
  all.genes <- sort(unique(c(unlist(goi.lst, use.names=FALSE), unlist(bg.lst, use.names=FALSE))))
  expr <- log10(cm[all.genes, ]+1)
  # 
  s.score <- enr.score(expr, goi.lst[['S']], bg.lst[['S']])
  g2m.score <- enr.score(expr, goi.lst[['G2M']], bg.lst[['G2M']])
  
  phase <- as.numeric(g2m.score > 2 & s.score <= 2)
  phase[g2m.score <= 2 & s.score > 2] <- -1
  
  return(data.frame(score=s.score-g2m.score, s.score, g2m.score, phase))
}

问题：

1. 这确实是一种打分模型，但是如何评估模型是否准确呢？

想构建一个打分模型，用于scRNA-seq的细胞类型打分，以及bulk RNA-seq的细胞组成成分预测。

其实有篇文章已经做过这种工作了，做了第一部分，第二部分还没有评估。