nutch-2.1导入eclipse+mysql运行

初次接触nutch,记录下来

首先数据库 

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci;

CREATE TABLE `webpage` (
  `id` varchar(767) NOT NULL,
  `headers` blob,
  `text` mediumtext,
  `status` int(11) default NULL,
  `markers` blob,
  `parseStatus` blob,
  `modifiedTime` bigint(20) default NULL,
  `score` float default NULL,
  `typ` varchar(32) default NULL,
  `baseUrl` varchar(767) default NULL,
  `content` longblob,
  `title` varchar(2048) default NULL,
  `reprUrl` varchar(767) default NULL,
  `fetchInterval` int(11) default NULL,
  `prevFetchTime` bigint(20) default NULL,
  `inlinks` mediumblob,
  `prevSignature` blob,
  `outlinks` mediumblob,
  `fetchTime` bigint(20) default NULL,
  `retriesSinceFetch` int(11) default NULL,
  `protocolStatus` blob,
  `signature` blob,
  `metadata` blob,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED;

eclipse安装svn,ivy,ant

以上两个插件是nutch项目租使用的插件,自行安装。

nutch2.1的远程svn库文件地址

https://svn.apache.org/repos/asf/nutch/tags/release-2.1

check out检出项目

默认直接finish并创建java project项目

等待下载完成

下载完成后(注:这里的nutch2西面已做更改成nutch-2.1)

在project explorer下右击项目,选择properties。进入java build path
Add Folder > 导入选择,并把plugin下面的项目中的src/java和src/test都加入进去

src/bin
src/java
src/test
src/testresources

这一步也可以直接修改项目中的classpath文件,然后在直接刷新项目来自动添加,这样比较方便,但要注意是否有添加错误

.classpath内容

<?xml version="1.0" encoding="UTF-8"?>
<classpath>
    <classpathentry kind="src" path="conf"/>
    <classpathentry kind="src" path="src/java"/>
    <classpathentry kind="src" path="src/test"/>
    <classpathentry kind="src" path="src/plugin/protocol-file/src/test"/>
    <classpathentry kind="src" path="src/plugin/protocol-httpclient/src/test"/>
    <classpathentry kind="src" path="src/plugin/subcollection/src/test"/>
    <classpathentry kind="src" path="src/plugin/parse-html/src/test"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/test"/>
    <classpathentry kind="src" path="src/plugin/parse-html/src/java"/>
    <classpathentry kind="src" path="src/plugin/parse-tika/src/test"/>
    <classpathentry kind="src" path="src/plugin/lib-http/src/test"/>
    <classpathentry kind="src" path="src/plugin/parse-tika/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-regex/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-domain/src/java"/>
    <classpathentry kind="src" path="src/plugin/scoring-link/src/java"/>
    <classpathentry kind="src" path="src/plugin/index-anchor/src/test"/>
    <classpathentry kind="src" path="src/plugin/protocol-http/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/test"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-prefix/src/java"/>
    <classpathentry kind="src" path="src/plugin/scoring-opic/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-domain/src/test"/>
    <classpathentry kind="src" path="src/plugin/protocol-file/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/java"/>
    <classpathentry kind="src" path="src/plugin/language-identifier/src/java"/>
    <classpathentry kind="src" path="src/plugin/lib-regex-filter/src/test"/>
    <classpathentry kind="src" path="src/plugin/language-identifier/src/test"/>
    <classpathentry kind="src" path="src/plugin/subcollection/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/test"/>
    <classpathentry kind="src" path="src/plugin/index-basic/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/test"/>
    <classpathentry kind="src" path="src/plugin/creativecommons/src/java"/>
    <classpathentry kind="src" path="src/bin"/>
    <classpathentry kind="src" path="src/plugin/protocol-httpclient/src/java"/>
    <classpathentry kind="src" path="src/plugin/tld/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/java"/>
    <classpathentry kind="src" path="src/plugin/index-basic/src/test"/>
    <classpathentry kind="src" path="src/plugin/lib-http/src/java"/>
    <classpathentry kind="src" path="src/plugin/protocol-ftp/src/java"/>
    <classpathentry kind="src" path="src/plugin/index-anchor/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-validator/src/java"/>
    <classpathentry kind="src" path="src/plugin/index-more/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/test"/>
    <classpathentry kind="src" path="src/plugin/creativecommons/src/test"/>
    <classpathentry kind="src" path="src/plugin/microformats-reltag/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-regex/src/test"/>
    <classpathentry kind="src" path="src/plugin/lib-regex-filter/src/java"/>
    <classpathentry kind="src" path="src/plugin/index-more/src/test"/>
    <classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/java"/>
    <classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/java"/>
    <classpathentry kind="src" path="src/testresources"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=ivy%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fcreativecommons%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Ffeed%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Findex-anchor%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Findex-basic%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Findex-more%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flanguage-identifier%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flib-http%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flib-nekohtml%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flib-regex-filter%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flib-xml%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fmicroformats-reltag%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fnutch-extensionpoints%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-ext%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-html%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-js%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-swf%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-tika%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-zip%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-file%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-ftp%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-http%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-httpclient%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-sftp%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fscoring-link%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fscoring-opic%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fsubcollection%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Ftld%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-automaton%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-domain%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-prefix%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-regex%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-suffix%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-validator%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlnormalizer-basic%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlnormalizer-pass%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlnormalizer-regex%2Fivy.xml&amp;confs=*"/>
    <classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/>
    <classpathentry kind="con" path="org.eclipse.jdt.junit.JUNIT_CONTAINER/4"/>
    <classpathentry kind="lib" path="lib/org.restlet-2.0.0.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.example.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.atom_1.0.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.atom.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.crypto.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.fileupload_1.2.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.freemarker_2.3.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.freemarker.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.grizzly.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.gwt.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.httpclient.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.jaas.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.jackson.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.jaxb_2.1.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.jaxrs_1.0.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.jaxrs-2.0-RC3.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.jibx_1.1.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.json_2.0.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.json.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.net.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.odata.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.rdf.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.servlet-2.0-RC3.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.servlet-2.0.0.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.servlet.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.spring_2.5.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.spring-2.0.0.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.velocity_1.5.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.wadl_1.0.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.xml.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.ext.xstream.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.gae-2.0-RC3.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.gwt.jar"/>
    <classpathentry kind="lib" path="lib/org.restlet.lib.org.json-2.0.jar"/>
    <classpathentry kind="lib" path="src/plugin/urlfilter-automaton/lib/automaton.jar"/>
    <classpathentry kind="lib" path="lib/mysql-connector-java-5.0.7.jar"/>
    <classpathentry kind="output" path="bin"/>
</classpath>

刷新项目就跟上面一样了

接下order and export中要把conf提到最前面加载

这里处理玩之后接下来就是导包的过程

安装ivy的插件则能直接右击ivy.xml

直接finish。jar就会自动下载下来,需要注意,这里的ivy.xml有很多文件,只要有jar的都要add ivy library一次

这样去找会消耗点时间

当所有的ivy到导入后,最后总会有几个jar不存在的

(这里网上自行下载了,我这里自己另加入的包有)

另还有一个包hadoop-core的包需要修改,FileUtil.java

详情见http://yangshangchuan.iteye.com/blog/1839784

摘录下来(在运行时会提示错误)

错误信息:
Exception in thread "main" java.io.IOException:Failed to set permissions of path:	mphadoop-yscmapredstagingysc-2036315919.staging to 0700
 
官方BUG参考:
https://issues.apache.org/jira/browse/HADOOP-7682
 
解决方法:
1、下载并解压http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-1.1.2/hadoop-1.1.2.tar.gz
2、修改hadoop-1.1.2srccoreorgapachehadoopfsFileUtil.java,搜索 Failed to set permissions of path,找到689行,把throw new IOException改为LOG.warn
3、修改hadoop-1.1.2uild.xml,搜索autoreconf,移除匹配的6个executable="autoreconf"的exec配置
4、下载解压ant,将ant目录下的bin目录加入环境变量path
5、在Cygwin命令下行切换到hadoop-1.1.2目录,执行ant
6、用新生成的hadoop-1.1.2uildhadoop-core-1.1.3-SNAPSHOT.jar替换nutch的hadoop-core-1.0.3.jar
7、对于eclipse开发来说,替换C:Usersysc.ivy2cacheorg.apache.hadoophadoop-corejarshadoop-core-1.1.2.jar
 
附件中的JAR是对hadoop1.2.1修改后的JAR,可用于Nutch1.7,其他Nutch版本没测试过。

我在修改的时候直接下载这个然后替换ivy库中的hadoop-core包,名称一样;

下载http://pan.baidu.com/s/1i3FBLEP

接下里就是配置

在nutch2.1/conf下
Gora.properties
加入:

    gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver  
    gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true  
    gora.sqlstore.jdbc.user=root  
    gora.sqlstore.jdbc.password=root  

并注释掉其他的数据库链接。
在ivy/ivy.xml

解除mysql-connector的注释。

在/conf/nutch-site.xml.template的configuration中添加如下代码:

    <property>  
    <name>http.agent.name</name>  
    <value>Your Nutch Spider</value>  
    </property>  
      
    <property>  
    <name>http.accept.language</name>  
    <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>  
    <description>Value of the “Accept-Language” request header field.  
    This allows selecting non-English language as default one to retrieve.  
    It is a useful setting for search engines build for certain national group.  
    </description>  
    </property>  
      
    <property>  
    <name>parser.character.encoding.default</name>  
    <value>utf-8</value>  
    <description>The character encoding to fall back to when no other information  
    is available</description>  
    </property>  
      
    <property>  
      <name>plugin.includes</name>  
     <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>  
     <description>Regular expression naming plugin directory names to  
      include.  Any plugin not matching this expression is excluded.  
      In any case you need at least include the nutch-extensionpoints plugin. By  
      default Nutch includes crawling just HTML and plain text via HTTP,  
      and basic indexing and search plugins. In order to use HTTPS please enable   
      protocol-httpclient, but be aware of possible intermittent problems with the   
      underlying commons-httpclient library.  
      </description>  
    </property>  
      
    <property>  
    <name>storage.data.store.class</name>  
    <value>org.apache.gora.sql.store.SqlStore</value>  
    <description>The Gora DataStore class for storing and retrieving data.  
    Currently the following stores are available: ….  
    </description>  
    </property>  
      
    <property>  
      <name>plugin.folders</name>  
      <value>./src/plugin</value>  
      <description>Directories where nutch plugins are located.  Each  
      element may be a relative or absolute path.  If absolute, it is used  
      as is.  If relative, it is searched for on the classpath.</description>  
    </property>   

在根目录下的build.xml中找到如下代码

    <target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy">  
      <ivy:resolve file="${ivy.file}" conf="default" log="download-only" />  
      <ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" />  
      <antcall target="copy-libs" />  
     </target>  

将原本的

    pattern="${build.lib.dir}/[artifact]-[revision].[ext]"  

改为

pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]" 

用来避免ivy再次下载编译不通过的情况。原因:ivy会下载class的jar和source的jar,当时如果直接按照上面的pattern下载的话,两个文件是无法区分的。会出现相同的文件的错误。

完成如上信息之后,点击build.xml进行ant编译就会生成runtime目录。

在根目录下添加一个urls文件夹,放入seed.txt文件,其中加一个网站地址。如:http://nutch.apache.org/
打开

src/java下的crawl的package下的crawler,使用run configuration

第一页已经默认填写完毕


选择第二个arguments
放入:

urls -depth 3 -topN 5
-Xms64m -Xmx512m


最后就可以使用run进行爬取该网站的链接信息了。

执行完后打印

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://nutch.apache.org/
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 84 84 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:    false
ParserJob: parsing all
Parsing http://nutch.apache.org/
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 6 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://cassandra.apache.org/
fetching http://nutch.apache.org/
fetching http://accumulo.apache.org/
fetching http://avro.apache.org/
fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
fetching http://code.google.com/p/crawler-commons/
-finishing thread FetcherThread1, activeThreads=9
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread3, activeThreads=7
-finishing thread FetcherThread6, activeThreads=6
-finishing thread FetcherThread0, activeThreads=5
-finishing thread FetcherThread8, activeThreads=4
-finishing thread FetcherThread7, activeThreads=3
-finishing thread FetcherThread9, activeThreads=2
0/2 spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 136 136 kb/s, 0 URLs in 2 queues
0/2 spinwaiting/active, 4 pages, 0 errors, 0.4 0.0 pages/s, 68 0 kb/s, 0 URLs in 2 queues
0/2 spinwaiting/active, 4 pages, 0 errors, 0.3 0.0 pages/s, 45 0 kb/s, 0 URLs in 2 queues
0/2 spinwaiting/active, 4 pages, 0 errors, 0.2 0.0 pages/s, 34 0 kb/s, 0 URLs in 2 queues
fetch of http://code.google.com/p/crawler-commons/ failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread4, activeThreads=1
fetch of http://blog.foofactory.fi/2007/03/twice-speed-half-size.html failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread5, activeThreads=0
0/0 spinwaiting/active, 6 pages, 2 errors, 0.2 0.4 pages/s, 27 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:    false
ParserJob: parsing all
Skipping http://sched.co/1pav9xl; different batch id (null)
Skipping http://sched.co/1pbE15n; different batch id (null)
Skipping http://t.co/k3VLhbJQhg; different batch id (null)
Skipping http://www.eu.apachecon.com/c/aceu2009/; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/136; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/137; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/138; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/165; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/197; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/201; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/250; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/251; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/schedule; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/331; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/332; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/333; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/334; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/335; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/375; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/427; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/428; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/430; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/437; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/461; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/462; different batch id (null)
Skipping http://www.cafepress.com/nutch; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106189987/; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106200690/; different batch id (null)
Skipping https://www.flickr.com/photos/mrmuskrat/3637703614/; different batch id (null)
Skipping https://www.flickr.com/photos/splorp/3981832163/; different batch id (null)
Skipping https://www.google-melange.com/gsoc/homepage/google/gsoc2014; different batch id (null)
Parsing http://code.google.com/p/crawler-commons/
Skipping https://twitter.com/ApacheNutch; different batch id (null)
Skipping https://twitter.com/ApacheNutch/status/591359830171856896; different batch id (null)
Skipping https://twitter.com/cutting/status/233415059798372353; different batch id (null)
Skipping https://twitter.com/TheASF; different batch id (null)
Skipping http://www.brics.dk/automaton/; different batch id (null)
Skipping http://www.brics.dk/automaton/automaton; different batch id (null)
Parsing http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
Parsing http://accumulo.apache.org/
Parsing http://avro.apache.org/
Skipping https://builds.apache.org/view/M-R/view/Nutch/; different batch id (null)
Parsing http://cassandra.apache.org/
Skipping https://cwiki.apache.org/confluence/display/solr/SolrCloud; different batch id (null)
Skipping http://gora.apache.org/; different batch id (null)
Skipping http://hadoop.apache.org/; different batch id (null)
Skipping http://hbase.apache.org/; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1047; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1591; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-841; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH/; different batch id (null)
Skipping http://lucene.apache.org/; different batch id (null)
Skipping http://lucene.apache.org/solr; different batch id (null)
Skipping http://lucene.apache.org/solr/; different batch id (null)
Parsing http://nutch.apache.org/
Skipping http://nutch.apache.org/bot.html; different batch id (null)
Skipping http://nutch.apache.org/credits.html; different batch id (null)
Skipping http://nutch.apache.org/downloads.html; different batch id (null)
Skipping http://nutch.apache.org/index.html; different batch id (null)
Skipping http://nutch.apache.org/javadoc.html; different batch id (null)
Skipping http://nutch.apache.org/mailing_lists.html; different batch id (null)
Skipping http://nutch.apache.org/version_control.html; different batch id (null)
Skipping http://s.apache.org/1.9-release; different batch id (null)
Skipping http://s.apache.org/1zE; different batch id (null)
Skipping http://s.apache.org/LPB; different batch id (null)
Skipping http://s.apache.org/nutch10; different batch id (null)
Skipping http://s.apache.org/nutch_2.3; different batch id (null)
Skipping http://s.apache.org/oHY; different batch id (null)
Skipping http://s.apache.org/PGa; different batch id (null)
Skipping http://tika.apache.org/; different batch id (null)
Skipping http://tika.apache.org/1.2/index.html; different batch id (null)
Skipping https://whimsy.apache.org/board/minutes/Nutch.html; different batch id (null)
Skipping http://wicket.apache.org/; different batch id (null)
Skipping http://wiki.apache.org/nutch/; different batch id (null)
Skipping http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer; different batch id (null)
Skipping http://wiki.apache.org/nutch/FAQ; different batch id (null)
Skipping http://wiki.apache.org/nutch/NutchPropertiesCompleteList; different batch id (null)
Skipping https://wiki.apache.org/nutch/FrontPage; different batch id (null)
Skipping https://wiki.apache.org/nutch/NutchRESTAPI; different batch id (null)
Skipping http://www.apache.org/; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.7/1.7-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.8/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.9/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.0/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.1/CHANGES-2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2/2.2-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.9.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.0.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.2.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.3.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.4.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.5.txt; different batch id (null)
Skipping http://www.apache.org/dyn/closer.cgi/nutch/; different batch id (null)
Skipping http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_04_21.txt; different batch id (null)
Skipping http://www.apache.org/foundation/sponsorship.html; different batch id (null)
Skipping http://www.apache.org/foundation/thanks.html; different batch id (null)
Skipping http://www.apache.org/licenses/; different batch id (null)
Skipping http://www.apache.org/licenses/LICENSE-2.0; different batch id (null)
Skipping http://www.apache.org/security/; different batch id (null)
Skipping http://creativecommons.org/press-releases/entry/5064; different batch id (null)
Skipping https://creativecommons.org/licenses/by-sa/2.0/; different batch id (null)
Skipping http://www.elasticsearch.org/; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-europe; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-north-america; different batch id (null)
Skipping http://search.maven.org/; different batch id (null)
Skipping http://mongodb.org/; different batch id (null)
Skipping http://osuosl.org/news_folder/nutch; different batch id (null)
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
fetching http://cassandra.apache.org/
fetching http://nutch.apache.org/
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://accumulo.apache.org/
fetching http://avro.apache.org/
QueueFeeder finished: total 11 records. Hit by time limit :0
fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
fetching http://www.apache.org/foundation/sponsorship.html
fetching http://code.google.com/p/crawler-commons/
fetching http://www.apache.org/security/
7/10 spinwaiting/active, 5 pages, 0 errors, 1.0 1.0 pages/s, 169 169 kb/s, 3 URLs in 3 queues
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1445831574525
  now           = 1445831574814
  0. http://www.apache.org/foundation/thanks.html
  1. http://www.apache.org/licenses/
  2. http://www.apache.org/
fetching http://www.apache.org/foundation/thanks.html
8/10 spinwaiting/active, 7 pages, 0 errors, 0.7 0.4 pages/s, 113 57 kb/s, 2 URLs in 3 queues
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1445831583211
  now           = 1445831579817
  0. http://www.apache.org/licenses/
  1. http://www.apache.org/
fetching http://www.apache.org/licenses/
8/10 spinwaiting/active, 8 pages, 0 errors, 0.5 0.2 pages/s, 86 31 kb/s, 1 URLs in 3 queues
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1445831587582
  now           = 1445831584820
  0. http://www.apache.org/
fetching http://www.apache.org/
-finishing thread FetcherThread9, activeThreads=8
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread0, activeThreads=7
-finishing thread FetcherThread1, activeThreads=6
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread3, activeThreads=4
-finishing thread FetcherThread5, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
0/2 spinwaiting/active, 9 pages, 0 errors, 0.5 0.2 pages/s, 84 81 kb/s, 0 URLs in 2 queues
fetch of http://blog.foofactory.fi/2007/03/twice-speed-half-size.html failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread8, activeThreads=1
fetch of http://code.google.com/p/crawler-commons/ failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread6, activeThreads=0
0/0 spinwaiting/active, 11 pages, 2 errors, 0.4 0.4 pages/s, 67 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:    false
ParserJob: parsing all
Skipping http://sched.co/1pav9xl; different batch id (null)
Skipping http://sched.co/1pbE15n; different batch id (null)
Skipping http://t.co/k3VLhbJQhg; different batch id (null)
Skipping http://accumulosummit.com/; different batch id (null)
Skipping http://www.amazon.com/Cassandra-High-Availability-Robbie-Strickland/dp/1783989122; different batch id (null)
Skipping http://www.eu.apachecon.com/c/aceu2009/; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/136; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/137; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/138; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/165; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/197; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/201; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/250; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/251; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/schedule; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/331; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/332; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/333; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/334; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/335; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/375; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/427; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/428; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/430; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/437; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/461; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/462; different batch id (null)
Skipping http://www.cafepress.com/nutch; different batch id (null)
Skipping http://www.datastax.com/dev/blog/2012-in-review-performance; different batch id (null)
Skipping http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html; different batch id (null)
Skipping http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_primary_index_c.html; different batch id (null)
Skipping http://www.datastax.com/resources/whitepapers/benchmarking-top-nosql-databases; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106189987/; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106200690/; different batch id (null)
Skipping https://www.flickr.com/photos/mrmuskrat/3637703614/; different batch id (null)
Skipping https://www.flickr.com/photos/splorp/3981832163/; different batch id (null)
Skipping http://getbootstrap.com/; different batch id (null)
Skipping https://github.com/apache/accumulo; different batch id (null)
Skipping http://glyphicons.com/; different batch id (null)
Skipping https://www.google-melange.com/gsoc/homepage/google/gsoc2014; different batch id (null)
Parsing http://code.google.com/p/crawler-commons/
Skipping http://research.google.com/archive/bigtable.html; different batch id (null)
Skipping https://www.linkedin.com/groups/Apache-Accumulo-Professionals-4554913; different batch id (null)
Skipping http://blog.markedup.com/2013/02/cassandra-hive-and-hadoop-how-we-picked-our-analytics-stack/; different batch id (null)
Skipping http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/; different batch id (null)
Skipping http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html; different batch id (null)
Skipping https://twitter.com/apacheaccumulo; different batch id (null)
Skipping https://twitter.com/ApacheNutch; different batch id (null)
Skipping https://twitter.com/ApacheNutch/status/591359830171856896; different batch id (null)
Skipping https://twitter.com/cutting/status/233415059798372353; different batch id (null)
Skipping https://twitter.com/TheASF; different batch id (null)
Skipping http://www.brics.dk/automaton/; different batch id (null)
Skipping http://www.brics.dk/automaton/automaton; different batch id (null)
Parsing http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
Skipping http://fontawesome.io/; different batch id (null)
Skipping http://freenode.net/; different batch id (null)
Skipping http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra; different batch id (null)
Skipping http://www.slideshare.net/daveconnors/cassandra-puppet-scaling-data-at-15-per-month; different batch id (null)
Skipping http://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376; different batch id (null)
Skipping http://www.slideshare.net/jbellis; different batch id (null)
Skipping http://www.slideshare.net/jbellis/cassandra-at-nosql-matters-2012; different batch id (null)
Skipping http://www.slideshare.net/planetcassandra/3-mohit-anchlia; different batch id (null)
Skipping http://www.slideshare.net/planetcassandra/nyc-tech-day-using-cassandra-for-dvr-scheduling-at-comcast; different batch id (null)
Skipping http://www.slideshare.net/slideshow/embed_code/15832310; different batch id (null)
Parsing http://accumulo.apache.org/
Skipping http://accumulo.apache.org/1.5/accumulo_user_manual.html; different batch id (null)
Skipping http://accumulo.apache.org/1.5/apidocs; different batch id (null)
Skipping http://accumulo.apache.org/1.5/examples; different batch id (null)
Skipping http://accumulo.apache.org/1.6/accumulo_user_manual.html; different batch id (null)
Skipping http://accumulo.apache.org/1.6/apidocs; different batch id (null)
Skipping http://accumulo.apache.org/1.6/examples; different batch id (null)
Skipping http://accumulo.apache.org/1.7/accumulo_user_manual.html; different batch id (null)
Skipping http://accumulo.apache.org/1.7/apidocs; different batch id (null)
Skipping http://accumulo.apache.org/1.7/examples; different batch id (null)
Skipping http://accumulo.apache.org/bylaws.html; different batch id (null)
Skipping http://accumulo.apache.org/contrib.html; different batch id (null)
Skipping http://accumulo.apache.org/downloads; different batch id (null)
Skipping http://accumulo.apache.org/downloads/; different batch id (null)
Skipping http://accumulo.apache.org/get_involved.html; different batch id (null)
Skipping http://accumulo.apache.org/git.html; different batch id (null)
Skipping http://accumulo.apache.org/glossary.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/consensusBuilding.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/lazyConsensus.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/releasing.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/voting.html; different batch id (null)
Skipping http://accumulo.apache.org/index.html; different batch id (null)
Skipping http://accumulo.apache.org/mailing_list.html; different batch id (null)
Skipping http://accumulo.apache.org/notable_features.html; different batch id (null)
Skipping http://accumulo.apache.org/old_documentation.html; different batch id (null)
Skipping http://accumulo.apache.org/papers.html; different batch id (null)
Skipping http://accumulo.apache.org/people.html; different batch id (null)
Skipping http://accumulo.apache.org/projects.html; different batch id (null)
Skipping http://accumulo.apache.org/rb.html; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/1.5.4.html; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/1.6.4.html; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/1.7.0.html; different batch id (null)
Skipping http://accumulo.apache.org/releasing.html; different batch id (null)
Skipping http://accumulo.apache.org/screenshots.html; different batch id (null)
Skipping http://accumulo.apache.org/source.html; different batch id (null)
Skipping http://accumulo.apache.org/verifying_releases.html; different batch id (null)
Skipping http://accumulo.apache.org/versioning.html; different batch id (null)
Parsing http://avro.apache.org/
Skipping http://avro.apache.org/credits.html; different batch id (null)
Skipping http://avro.apache.org/docs/1.6.3; different batch id (null)
Skipping http://avro.apache.org/docs/1.7.7; different batch id (null)
Skipping http://avro.apache.org/docs/current; different batch id (null)
Skipping http://avro.apache.org/docs/current/; different batch id (null)
Skipping http://avro.apache.org/index.html; different batch id (null)
Skipping http://avro.apache.org/irc.html; different batch id (null)
Skipping http://avro.apache.org/issue_tracking.html; different batch id (null)
Skipping http://avro.apache.org/mailing_lists.html; different batch id (null)
Skipping http://avro.apache.org/releases.html; different batch id (null)
Skipping http://avro.apache.org/version_control.html; different batch id (null)
Skipping http://blogs.apache.org/accumulo; different batch id (null)
Skipping https://blogs.apache.org/accumulo/; different batch id (null)
Skipping https://builds.apache.org/view/A-D/view/Accumulo/; different batch id (null)
Skipping https://builds.apache.org/view/M-R/view/Nutch/; different batch id (null)
Parsing http://cassandra.apache.org/
Skipping http://cassandra.apache.org/download/; different batch id (null)
Skipping http://cassandra.apache.org/privacy.html; different batch id (null)
Skipping https://cwiki.apache.org/confluence/display/AVRO/How+To+Contribute; different batch id (null)
Skipping https://cwiki.apache.org/confluence/display/AVRO/Index; different batch id (null)
Skipping https://cwiki.apache.org/confluence/display/solr/SolrCloud; different batch id (null)
Skipping http://forrest.apache.org/; different batch id (null)
Skipping http://gora.apache.org/; different batch id (null)
Skipping http://hadoop.apache.org/; different batch id (null)
Skipping http://hadoop.apache.org/privacy_policy.html; different batch id (null)
Skipping http://hbase.apache.org/; different batch id (null)
Skipping https://issues.apache.org/jira/browse/accumulo; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1047; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1591; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-841; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH/; different batch id (null)
Skipping http://lucene.apache.org/; different batch id (null)
Skipping http://lucene.apache.org/solr; different batch id (null)
Skipping http://lucene.apache.org/solr/; different batch id (null)
Parsing http://nutch.apache.org/
Skipping http://nutch.apache.org/bot.html; different batch id (null)
Skipping http://nutch.apache.org/credits.html; different batch id (null)
Skipping http://nutch.apache.org/downloads.html; different batch id (null)
Skipping http://nutch.apache.org/index.html; different batch id (null)
Skipping http://nutch.apache.org/javadoc.html; different batch id (null)
Skipping http://nutch.apache.org/mailing_lists.html; different batch id (null)
Skipping http://nutch.apache.org/version_control.html; different batch id (null)
Skipping http://s.apache.org/1.9-release; different batch id (null)
Skipping http://s.apache.org/1zE; different batch id (null)
Skipping http://s.apache.org/LPB; different batch id (null)
Skipping http://s.apache.org/nutch10; different batch id (null)
Skipping http://s.apache.org/nutch_2.3; different batch id (null)
Skipping http://s.apache.org/oHY; different batch id (null)
Skipping http://s.apache.org/PGa; different batch id (null)
Skipping http://thrift.apache.org/; different batch id (null)
Skipping http://tika.apache.org/; different batch id (null)
Skipping http://tika.apache.org/1.2/index.html; different batch id (null)
Skipping https://whimsy.apache.org/board/minutes/Nutch.html; different batch id (null)
Skipping http://wicket.apache.org/; different batch id (null)
Skipping http://wiki.apache.org/cassandra; different batch id (null)
Skipping http://wiki.apache.org/cassandra/Durability; different batch id (null)
Skipping http://wiki.apache.org/cassandra/FAQ; different batch id (null)
Skipping http://wiki.apache.org/cassandra/GettingStarted; different batch id (null)
Skipping http://wiki.apache.org/cassandra/HintedHandoff; different batch id (null)
Skipping http://wiki.apache.org/cassandra/HowToContribute; different batch id (null)
Skipping http://wiki.apache.org/cassandra/ReadRepair; different batch id (null)
Skipping http://wiki.apache.org/cassandra/ThirdPartySupport; different batch id (null)
Skipping http://wiki.apache.org/nutch/; different batch id (null)
Skipping http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer; different batch id (null)
Skipping http://wiki.apache.org/nutch/FAQ; different batch id (null)
Skipping http://wiki.apache.org/nutch/NutchPropertiesCompleteList; different batch id (null)
Skipping https://wiki.apache.org/nutch/FrontPage; different batch id (null)
Skipping https://wiki.apache.org/nutch/NutchRESTAPI; different batch id (null)
Parsing http://www.apache.org/
Skipping http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.7/1.7-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.8/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.9/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.0/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.1/CHANGES-2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2/2.2-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.9.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.0.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.2.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.3.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.4.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.5.txt; different batch id (null)
Skipping http://www.apache.org/dyn/closer.cgi/nutch/; different batch id (null)
Skipping http://www.apache.org/foundation/policies/conduct.html; different batch id (null)
Skipping http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_04_21.txt; different batch id (null)
Parsing http://www.apache.org/foundation/sponsorship.html
Parsing http://www.apache.org/foundation/thanks.html
Parsing http://www.apache.org/licenses/
Skipping http://www.apache.org/licenses/LICENSE-2.0; different batch id (null)
Parsing http://www.apache.org/security/
Skipping http://zookeeper.apache.org/; different batch id (null)
Skipping http://creativecommons.org/press-releases/entry/5064; different batch id (null)
Skipping https://creativecommons.org/licenses/by-sa/2.0/; different batch id (null)
Skipping http://www.elasticsearch.org/; different batch id (null)
Skipping http://hypertable.org/; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-europe; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-north-america; different batch id (null)
Skipping http://search.maven.org/; different batch id (null)
Skipping http://mongodb.org/; different batch id (null)
Skipping http://osuosl.org/news_folder/nutch; different batch id (null)
Skipping http://www.planetcassandra.org/; different batch id (null)
Skipping http://planetcassandra.org/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/analytics-at-github-with-apache-cassandra/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/cassandra-at-cern-large-hadron-collider/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/cassandra-used-to-build-scalable-and-highly-available-systems-at-hulu-streaming-content-to-over-5-million-subscribers/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/godaddy-worlds-largest-domain-name-registrar-and-web-host-provider-utilizes-cassandra-for-replication-and-scalability/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/instagram-making-the-switch-to-cassandra-from-redis-75-instasavings/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/make-it-rain-apache-cassandra-at-the-weather-channel-for-severe-weather-alerts/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/reddit-upvotes-apache-cassandras-horizontal-scaling-managing-17000000-votes-daily/; different batch id (null)
Skipping http://planetcassandra.org/companies/; different batch id (null)
Skipping http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf; different batch id (null)

表中插入的数据

到直接基本算是在eclipse导入完成

接下自己慢慢学习了

 ---------------------------------------------------------------------------------

另一种简单方式

File > New > Project > SVN > 从SVN 检出项目
创建新的资源库位置 > 

URL:https://svn.apache.org/repos/asf/nutch/tags/release-1.7/

选中URL > Finish    弹出New Project向导,选择Java Project > Next,

输入Project name:nutch1.7 > Finishsd    

搭建环境

在左部Package Explorer的 nutch1.7文件夹上单击右键 >Build Path > Configure Build Path...
> 选中Source选项 > 选择src > Remove > Add Folder... > 选择src/bin, src/Java, src/test 和 src/testresources
切换到Libraries选项 >
Add Class Folder... > 选中nutch1.7/conf
Add Library... > IvyDE Managed Dependencies > Next >Main > Ivy File > Browse > ivy/ivy.xml > Finish
切换到Order and Export选项>选中conf > Top > OK

最后:在左部Package Explorer的 nutch1.7文件夹下的build.xml文件上单击右键 > Run As > Ant Build      (然后等待完成)
在左部Package Explorer的 nutch1.7文件夹上单击右键 > Refresh
在左部Package Explorer的 nutch1.7文件夹上单击右键 > Build Path > Configure Build Path... > 选中Libraries选项 > Add Class Folder... > 选中build >
等待完成

OK,整个工程导入完成,没有红叉

原文地址:https://www.cnblogs.com/hwaggLee/p/4910931.html