Windows下myeclipse 安装 Nutch1.2(终于找到个不报错的)

1.下载并安装cygwin，安装和环境配置不细说了。将%CYGWIN_HOME%\bin加到path中。

2.导入到Eclipse中

①在Eclipse中添加File > New > Project > Java project。
project name随便，选择 “Create project from existing source” ，在browse中选nutch的解压路径，如D:\nutch-1.2

②在“Add Class Folder” 中选择 conf 文件夹。

③然后再定义一个“Default ouput folder” ，名称任意。注意不能选bin文件夹，因为如果选了bin文件夹做为Default output folder 编译时会清空该文件夹，bin下的其他文件会被删掉，导致其他问题。

④Finish.

3.修改Nutch的配置文件，这里以抓取www.163.com为例。

①修改D:\nutch-1.2\conf下的nutch-site.xml配置

<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-1.2</value>
<description>HTTP 'User-Agent'</description>
</property>
<property>
<name>searcher.dir</name>
<value>D:\nutch-1.2\crawl</value>
<description>Path to root of crawl.</description>
</property>
</configuration>

复制代码

②修改在D:\nutch-1.2\conf下的crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*163.info/
# skip everything else

复制代码

③修改D:\nutch-1.2\conf下的nutch-default.xml

<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>

复制代码

④在D:\nutch-1.2\下，建立名为urls的文件夹，并在文件夹内建立url.txt的文本，写入

http://www.163.com/

复制代码

4. 在Eclipse里运行Nutch

①Run-open run dialog

②name随便写

③在main class填写

org.apache.nutch.crawl.Crawl

复制代码

④arguments填写

urls -dir crawl -depth 3 -topN 50

复制代码

⑤在VM arguments填写

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

复制代码

OK，运行，看Nutch在爬啊爬啊。