《textanalytics》课程简单总结（2）：topic mining

coursera上的公开课《https://www.coursera.org/course/textanalytics》系列，讲的很不错哦。

1、“term as topic”有非常多问题：

2、Improved Idea: Topic = Word Distribution：

3、定义问题（Probabilistic Topic Mining and Analysis）：

4、解决这个问题之道（Generative Model for Probabilistic Topic Mining and Analysis）：

– Model data generation with a prob. model: P(Data |Model, λ)
– Infer the most likely parameter values λ* given a particular data set: λ* = argmaxλ p(Data| Model, λ)
– Take λ* as the “knowledge” to be mined for the text mining problem
– Adjust the design of the model to discover different knowledge

当中：λ=({ theta1, …, thetak }, { π11, …, π1k }, …, { πN1, …, πNk })

5、The Simplest Language Model（generative model）: Unigram LM

通过独立的生成每个词进而产生文档，因此：
• p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• 參数为: {p(wi)} ，且 p(w1)+…+p(wN)=1 (N is voc. size)
• Text = sample drawn according to this word distribution，比如：

p(“today is Wed”) = p(“today”)p(“is”)p(“Wed”) = 0.0002 * 0.001 * 0.000015

6、两种预计文本产生概率的办法：

•最大似然预计

“最好”意味着“样本数据的似然值达到最大”：。