论文 “tracking evolving communities in large linked networks” 中不懂的问题和知识总结

第一段解读：

　　1、文章的目的：We are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters (communities) over time.

　　2、数据：the NEC CiteSeer database, a linked network of >250,000 papers.

　　3、前提条件：Tracking changes over time requires a clustering algorithm that produces clusters stable under small perturbations of the input data.(这里为什么是这样，由于我没看过其他什么paper，所以不知道)

　　3.1 然后，就是本论文中用到的数据的一些问题：small perturbations of the CiteSeer data lead to signiﬁcant changes to most of the clusters.这一问题很明显和上述的前提条件不符合，后面就引入了它是怎么解决这个问题的(natural communities)。这里先说一下造成本数据集“小扰动，大变化”的原因：One reason for this is that the order inwhich paperswithin communities are combined is somewhat arbitrary.

　　3.2 本论文通过natural communities来解决的上述3中的问题

　　However, certain subsets of papers, called natural communities, correspond to real structure in the CiteSeer database and thus appear in any clustering. By identifying the subset of clusters that remain stable under multiple clustering runs, we get the set of natural commu- nities that we can track over time.

第二段解读：

　　第二段主要说的是当前的研究背景

　　1、复杂网络、动态网络的流行

　　2、当前主要的研究方法：当前的研究主要在静态属性上，没有引入动态的概念。然后说明了时序的重要性，再就是引出他这个文章要在动态时序上做文章。

第三段解读：

　　1、本段开头先说明本论文用的什么聚类算法来做的实验：In our approach, we use agglomerative clusterings of the linked network.

　　2、agglomerative clusterings是怎么用到时序演化的呢？作者这么说：By clustering the network at different points in time, we study its temporal evolution.

　　3、接下来就是说这一聚类算法的缺陷：

　　3.1 聚类算法对输入数据的微小改动都会造成结果的很大变动。

　　3.2 以及如果仅仅为了得到静态层次结构，尽管有这一缺陷，结果也足以接受。

　　3.3 我们这里是要得到时序变化的结果，这一缺陷就是很大的问题。

　　上述三个问题在论文中的描述如下:

　　3.1、This approach places a new burden on the underlying clustering method. Clustering methods can be surprisingly sensitive to minor changes of the input data.(为什么呢？我不知道。。)

　　3.2、For obtaining a static view of the higher-level structure of the data, such instabilities may be acceptable because the resulting hierarchy often already reveals interesting structure.

　　3.3、However, in tracking changes over time, we need to be able to find corresponding communities in clusterings taken from the data at different points in time. If the clusterings are very sensitive to small perturbations of the input data, distinguishing between "real" changes versus "accidental" changes in the higher-level structure becomes difficult, if not impossible.

　　4、接下来就是说明这一缺陷在我们数据集中的体现，和我们是怎么解决这一问题的。(即引出了natural communities)

　　4.1 本数据集中的体现：In the clusterings of our linked network data, we found there are a large number of relatively random clusters that do not correspond to real community structures. These random clusters obscure the real temporal changes.

　　4.2 解决方法：Fortunately, we found that, when performing a series of agglomerative clustering runs, each run on slightly perturbed input data, one can identify a stable set of clusters that occur in a significant proportion of the clusterings. Moreover, these stable clusters appear to correspond to the true underlying community structure of the network. We refer to such stable clusters as natural communities.

　　5、总结natural communities

　　We use the notion of natural communities to show that we can track these natural communities effectively over time, and can therefore characterize the temporal evolution of the network.

Data Set段落解读：

　　这段只要介绍的是数据的选择。我这里简单描述一下，作者选择了NEC CiteSeer数据库中1990年到2001年10年的数据，其中每年大概会增加25,000篇论文，这样总共要用到的就是25,000篇论文。这25,000篇文章内容都保存的比较全，可以得到我们需要的信息。还有这些论文主要都是计算机科学相关方面的论文。。同时作者还加入了90年之前1.6million篇论文，但这些文章的引用信息不全。所以我们分析的数据草图描述如下：

　　关于这里有向网络的构建问题就不再描述：就是根据相互之间是否有reference来定出度入度为0或1。

　　这节自己感觉比较有趣的地方是下面这段话：

　　The basic statistics of this graph already reveal that its structure is very different from a standard random graph. About 1 in every 100 papers receives 20 citations, 1 in every 1,000 papers has 300 citations or more, and 18 papers of the 1.85 million have 1,000 citations. This pattern is indicative of the heavy-tailed nature of the data, characterized by a power law in the in-degree (8). An interesting research question concerns the role of the highly cited papers. For example, are such nodes essential in the definition of the hidden community structure or does such structure remain even after removing high degree nodes from the graph? Also, are such nodes essential in the formation of new communities?

Instabilities and Natural Communities段解读：

　　这一段主要包括两个方面：1、Instabilities 2、Natural Communities 在具体讲解之前先引入构建网络的过程(因为这里的不稳定和Natural Communities都是在建好网络之后讨论的)

　　之前第三段解读：2、中已经说了本论文时怎样来做时序演化的。这里给出每一时刻的聚类算法：

　　本文章采用的是Hierarchical agglomerative clustering，这一算法相信大家都比较熟悉。流程就是1、starts with each paper in a cluster by itself 2、at each stage, the two "closest" clusters are merged. 然后重复执行直到最后所有的paper在一个cluster里。这样就形成一个clustering tree。这棵树中的叶子节点代表具体的paper，每个内部节点代表a cluster resulting from merging its two children。在聚类过程中测量两个paper之间的相似度时作者采用cosine similarity，当合并两个paper到一个cluster里后，这个cluster的length vector是the normalized sum of all of the individual papers’ reference vectors。最后衡量两个cluster C 、C' 之间的距离公式略(但比较重要，具体到论文里看)(PS:The square root scaling factor is used to force smaller communities to merge together before larger ones.This particular scaling factor leads to well balanced merge trees)(这个PS是针对的公式中的东西)

　　然后是讲的我们这里定义的这个距离公式是文献耦合的一种，又说了说其他更好的文献耦合测量距离方法，然后又说了说为啥这里不选用更好的测量方法，而选用余弦相似度测量方法。

　　最后一段时说为了验证这里选取的聚类方法和距离函数符合要求，和k-means方法做了一个比较(比较的方法是看一个cluster里面要包含其中90%的paper需要多少类的杂志或期刊)(PS：大部分的杂志和期刊都只包含一类型的paper，这样要包含一个cluster里90%的paper所需要的期刊杂志越少的，这个cluster效果更好)

==============

　　然后下面就是根据聚类出的结果的不稳定性，构造natural communities，再在这个natural communities上做出作者最后的观察。

　　不稳定性讲解：

　　前面已经说过小的数据扰动都会对聚类的结果产生很大的变动，而我们所采用的数据也确实有这些因素的存在，作者为了能够观察时序网络的变化，采用的是每次给数据一个5%的扰动，然后做45次聚类结果，在这45次结果中寻找相对来说比较稳定的cluster，就把这些cluster作为natural communities，认为这些natural communities 能够代表网络的实质变化。下面就给出这其中过程的具体方法：

　　1、To determine the set of natural communities, we examine changes in the agglomerative clustering trees under minor perturbations of the input data.这样做45次后我们就形成45个数据集，在这45个数据集上做层次聚类，得出45个聚类树，分别是T1，T2，T3，...，T45。

　　以T1中的任意一个聚类C，我们来寻找这个C和其他44个聚类树Ti中任意聚类C'的匹配度match(C,C')，这个match(C,C')公式在论文上有。然后定义T1中任意聚类C和其他44个Ti的best_match(C,T)公式。这段自己描述的不太清楚，论文上讲的比较好，尤其这句话解释了作者的想法：

　　Interestingly, we can take advantage of these instabilities, because these clusters are not uniformly unstable and therefore can be exploited to uncover the true hidden structure of the data.

　　2、Natural communities的定义：

　　这里的定义其实我在1里面已经有点解释了：也就是把45个聚类树中的第一个T1作为base tree。这T1当然就有很多不同规模(即聚类C中的paper数目)的聚类C1，C2，C3...：

　　　　然后我们任取这T1中的一个聚类Ci，计算best_match(Ci, Tj) （PS:j = 2:45），如果best_match(Ci, Tj) 大于一个阈值p，我们计count加1，算完所有的44个Tj之后，我们看超过p的best_match有多少个，即count。然后计算count/45 是否大于一个阈值f。。如果满足这两个条件(p,f)。。那么就把这个Ci作为一个natural communities。。对T1中所有的聚类Ci进行上述处理，就可以得到所有的natural communities。。

PS:这段关于natural communities的定义，自己之前一直没看懂，最后还是师兄帮忙~~再次感谢陈师兄和杨师兄~~

　　3、下面就到了最关键的Tracking Natural Communities：

　　在进行利用之前构造的natural communities来进行追踪时序变化时，到底这个natural communities的效果好不好呢？作者是这样来描述这个问题的，还是作者描述的透彻：

　　The key question remaining is how well the natural communities allow us to track the temporal evolution of the community structure in our network data.

　　针对这一问题，作者提出实际的问题即是：In particular, we need to validate that when the network evolves over time and a few years of papers are added.

　　这一问题具体表现在两方面：

　　(i) there is not a dramatic shift in terms of natural communities,

　　(ii) the occurring changes have a plausible interpretation in terms of the evolution of the field.

　　然后作者说他这个方法是怎么满足这两个条件，所以natural communities是怎么的好等等，文中的描述：The results discussed below will show that our notion of natural communities satisfies both criteria, thereby making the concept a good candidate for use in temporal tracking in large networked data sets.　　

然后下面才开始讲他的方法：

　　PS：悲剧的刚才码了很多字，突然自动重启~~

　　重新说作者这里的方法吧。他把90年到2001年的数据分成两组，一组是90年到98年的，一组是99年到01年的。然后在99到01的数据上大约有100个paper量在100到350的natural communities。选取其中的18个(20%)做进一步分析。这18个natural communities包含有3200篇paper，把他们作为一个整体来构建一个子引用网络，这样就构成一个P2001子图(时间序列是99年到01年的)，下面开始引出怎样构建90年到99年的子图：

　　We also removed some low-quality information: all core papers that reference fewer than five other papers and all noncore papers only referenced once. This reduced the size of the subgraph by 20%.We repeat this procedure to create our 1998 graph by starting with papers
up to and including 1998 from the set P2001 (a subset of 2,791 papers).

　　这样就构成了动态时序网络~~然后下面就是作者对这个时序网络的分析~~然后最后是相关方面的工作~~

　　总结：这篇paper可以算作自己动态方面的第一篇文章~看的很慢~里面有很多知识都不是很懂~当然背景知识也比较多~~至于到现在看到这里自己学习到了什么~也没有了精力和时间再去整理了~留作以后看的paper多了再回头看看吧~~目前还有的遗憾是：最后track natural communities部分看的比较急，理解的不是很透彻~~还有最后的related work其实也挺重要的~但自己现在没时间去细看去查相关paper了~~

　　如果哪位也看过这篇文章~可以一起再交流交流~有做数据挖掘的或者敢兴趣的也可以一起交流~~我接下来要看的paper列出几篇来：

　　[1] D. Chakrabarti, R. Kumar, and A. Tomkins. Evolutionary clustering. In Proc. of the 12th ACM SIGKDD Conference, 2006.

　　[2] Y.-R. Lin, Y. Chi, S. Zhu, H. Sundaram, and B. L. Tseng. FacetNet: A framework for analyzing communities and their evolutions in dynamic networks. In Proc. WWW2008, pages 685-694.

　　[3] M.-S. Kim, and J. Han. A Particle-and-Density Based Evolutionary Clustering Method for Dynamic Networks. In 2009 Int. Conf. on Very Large Data Bases, , Lyon, France, August 2009.

pps:(学习论文里的第一篇文章，也不知道以后该怎么看paper，这样太慢了~先这样详细读一篇来了解了解吧~~)。。有做动态、聚类、生物信息学的。可以一起交流啊！！我Q Q 56 4 56 2 69 7