nutch 采集到的数据与实际不符

现象,这个网站我总计能抽取将近500个URL,但实际只抽取了100条

解析:nutch默认从一个页面解析出的链接,只取前 100 个。

<property> <name>db.max.outlinks.per.page</name> <value>100</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property>

将这个值改大一些 1000 .
原文地址:https://www.cnblogs.com/i80386/p/3957763.html