Nutch之初体验(1)

Nutch 是一个网络爬虫，网上的例子多是基于 0.7 版本的，而对于0.8,却有一些不同的地方,折腾了许久，最终搞定。

系统环境： Tomcat 5.0.12/JDK1.5/nutch0.8.1/cygwin

配置过程：

1．因为 nutch 的运行需要 unix 环境，所以对于 windows 用户，要先下载一个 cygwin ，它是一个自由软件，可在 windows 下模拟 unix 环境，你可以到 http://www.cygwin.com/ 下载在线安装程序.

2．下载 nutch0.8.1 ，下载地址 http://apache.justdn.org/lucene/nutch/ ，我下载后是解压到 D:\ nutch-0.8.1

3．在 nutch-0.8.1 新建文件夹 urls ，在 urls 建一文本文件，文件名任意(如nutch)，添加一行内容： http://localhost:8080/ ，这是要搜索的网址

4．打开 nutch-0.8.1 下的 conf ，找到 crawl-urlfilter.txt ，找到这两行
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
红色部分是一个正则式，你要搜索的网址要与其匹配，在这里我改为 http://localhost:8080/

5． OK ，下面开始对搜索网址建立索引，运行 cygwin ，会打开一个命令窗口，输入 ”cd cygdrive/d/ nutch-0.8.1” ，转到 nutch-0.8.1 目录

6．运行Crawl命令抓取网站内容.Nutch的爬虫抓取网页有两种方式，一种方式是Intranet Crawling，针对的是企业内部网或少量网站，使用的是crawl命令；另一种方式是Whole-web crawling，针对的是整个互联网，使用inject、generate、fetch和updatedb等更底层的命令。
执行 ”bin/nutch crawl urls -dir crawl-depth 5
参数意义如下（可参见apache 网站 http://lucene.apache.org/nutch/tutorial8.html ）：
-dir dir names the directory to put the crawl in.
-threads threads determines the number of threads that will fetch in parallel.
-depth depth indicates the link depth from the root page that should be crawled.
-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
crawl.log ：日志文件

执行后可以看到 nutch-0.8.1 下新增一个 crawl 文件夹，它下面有 5 个文件夹：
① / ② crawldb/ linkdb ： web link 目录，存放 url 及 url 的互联关系，作为爬行与重新爬行的依据，页面默认 30 天过期（可以在 nutch-site.xml 中配置，后面会提到）
③ segments ：一存放抓取的页面，与上面链接深度 depth 相关， depth 设为 5 则在 segments 下生成5个以时间命名的子文件夹，打开此文件夹可以看到，它下面还有 6 个子文件夹，分别是（apache 网站 http://lucene.apache.org/nutch/tutorial8.html ）：

crawl_generate ： names a set of urls to be fetched

crawl_fetch ： contains the status of fetching each url
content ： contains the content of each url
parse_text ： contains the parsed text of each url
parse_data ： contains outlinks and metadata parsed from each url
crawl_parse ： contains the outlink urls, used to update the crawldb

④ indexes ：索引目录，我运行时生成了一个 ” part-00000” 的文件夹，
⑤ index ： lucene 的索引目录（ nutch 是基于 lucene 的，在 nutch-0.8.1\lib 下可以看到 lucene-core-1.9.1.jar ），是 indexs 里所有 index 合并后的完整索引，注意索引文件只对页面内容进行索引，没有进行存储，因此查询时要去访问 segments 目录才能获得页面内容
7．进行简单测试，在 cygwin 中输入 ”bin/nutch org.apache.nutch.searcher.NutchBean tomcat” ，即调用 NutchBean 的 main 方法搜索关键字 ”tomcat” ，在 cygwin 可以看到搜索出： Total hits: 67
注意：如果发现搜索结果始终为 0 ，则需要配置一下 nutch-0.8.1 \conf 的 nutch-site.xml(我在这里折腾了许久!) ，配置内容和下面过程 9 的配置相同 ( 另外，过程 6 中 depth 如果设为 1 也可能造成搜索结果为 0) ，然后重新执行过程 6 .

8．下面我们要在 Tomcat 下进行测试， nutch-0.8.1 下面有 nutch-0.8.1.war ，拷贝到 Tomcat\webapps 下，Tomcat 启动后自动解压，解压文件夹名为： nutch-0.8.1

9．打开 nutch\WEB-INF\classes 下 nutch-site.xml 文件，下面红色为需要新增的内容，其他为原 nutch-site.xml 内容
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
<name>http.agent.name</name>
<value>*</value>
<description></description>
</property>

<property>
<name>searcher.dir</name>
<value>D:\nutch-0.8.1\crawl</value>
<description></description>
</property>
</configuration>
http.agent.name ：必须，如果去掉这个 property 查询结果始终为0(网上的例子很多都没讲这一点!)
searcher.dir ：指定前面在 cygwin 中生成的 crawl路径
其中我们还可以设置重新爬行时间（在过程 6 提到：页面默认 30 天过期）
<property>
<name>fetcher.max.crawl.delay</name>
<value>30</value>
<description></description>
</property>

另外还有很多参数可以在 nutch-0.8.1\conf 下的 nutch-default.xml 查询， nutch-default.xml 中的 property 配置都带有注释，有兴趣的可以分别拷贝到 Tomcat\webapps\nutch\WEB-INF\classes\nutch-site.xml 中进行调试

10.打开 http://localhost:8081/nutch ，输入 ”tomcat” ，可以看到 ” 共有67项查询结果 ” ，和上面在过程 7 进行简单测试的结果一致 .

结果如下: