在eclipse配置nutch1.2

原文地址：http://blog.sina.com.cn/s/blog_7645c67301017ban.html

1、下载nutch1.2到指定一个目录下，并打开eclipse新建一个java工程。并选择"Create project from existing source"，指向nutch目录。

2、下一步操作，切换到"Libraries"选择"Add Class Folder..." 按钮，从列表中选择"conf"，继续操作：切换到"Order and Export"找到"conf"，把它移到顶。

3、到"Source"将output folder设置为Nutch /bin/tmp_build，点击finish完成导入。

4、配置文件：nutch-default.xml，nutch-site.xml，crawl-urlfilter.txt。
1) nutch-default.xml
修改此处：

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value> 
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

2）nutch-site.xml

在<configuration></configuration>中添加：

<property>
        <name>http.agent.name</name>
        <value>my nutch agent</value>
    </property>
    <property>
        <name>http.agent.version</name>
        <value>1.2</value>
 </property>

3)crawl-urlfilter.txt

删除：MY.DOMAIN.NAME

添加：+^http://([a-z0-9]*/.)*qq.com/

下面是自己的做法：直接就是accept anything else 配置成“+.”。

在nutch目录新建weburls.txt，并添加入口地址:http://www.qq.com

5、执行抓取

运行crawl 可执行类
配置运行环境：
Program arguments:

bin/nutch crawl weburls.txt -dir localweb -depth 50 -topN 100 -threads 2

VM arguments:

 -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

6）有可能会碰到一些报错的情况，例如找不到类。这时候可能需要用ant重新编译一下nutch，切换到nutch的安装根目录，然后执行ant命令，成功编译后，再试试看。

PS：本文蓝色部分为自己的做法，其它参考于其它文章。