Indexing the World Wide Web: the Journey So Far阅读笔记

文献文档用google搜索标题即可。

term预处理:用空格切分,去除标点,去除撇号,归一化小写,去除变音符号,词干还原(?),去除停用词,挖掘词组。

索引选型工程最佳实践:term粒度、按doc分块、全内存索引

Variable Byte encoding索引压缩

posting list:high impact->high term freq->sort by docid

索引分层:高频更新-重要的小型索引,中频更新-较为重要的中型索引,低频更新-不重要的大型索引,MapReduce构建后写入GFS

Doc features for ranking:

term freq,key terms,title,heading,url depth,term proximity,term positions,term in first part of page,offensive terms,outgoing links,bad sentence/structure,avg length of good sentence,ratio of visible keywords to those invisible,topic,entity(time,location),PageRank,anchor text,click-queries

原文地址:https://www.cnblogs.com/yaoyaohust/p/10007286.html