Ubuntu环境下利用ant编译nutch2.2.1 & 配置nutch2.2.1

/×××××××××××××××××××××××××××××××××××××××××/

Author：xxx0624

HomePage：http://www.cnblogs.com/xxx0624/

/×××××××××××××××××××××××××××××××××××××××××/

利用ant编译nutch2.x

详见：1. http://blog.javachen.com/2014/05/20/nutch-intro/

　　 2. wiki.apache.org/nutch/Nutch2Tutorial

　　　3. http://duguyiren3476.iteye.com/blog/2085973 （编译过程参见这个地址）

前提条件：配置ant（http://www.cnblogs.com/xxx0624/p/4172277.html）

1. 下载nutch（例如：我的是apache-nutch-2.2.1-src.tar.gz）

解压，重命名nutch文件夹（命名为nutch），然后移动文件夹到/home文件夹下

2. 编译nutch

　　2.1 准备工作

　　（1）待会儿编译可能会出现这个错误

Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-probe-antlib:
ivy-download:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

　　原因：nutch中缺少相应的jar包。

　　解决办法：

　　　　（1）下载sonar-ant-task-2.1.jar，并直接放到nutch文件夹目录下

　　　　（2）修改build.xml文件，从而引入这个新的jar

<!-- Define the Sonar task if this hasn't been done in a common script -->
<taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">
    <classpath path="${ant.library.dir}" />
    <classpath path="${mysql.library.dir}" />
    <classpath><fileset dir="." includes="sonar*.jar" /></classpath> 
</taskdef>

　　（2）编译时间过长：

　　nutch使用ivy进行构建，故编译时间长。如果时间过长，可使用该办法解决。

　　修改该文件：ivy/ivysettings.xml

http://mirrors.ibiblio.org/maven2/

替换掉：

http://repo1.maven.org/maven2/

　　2.2 修改nutch配置

　　（1）修改nutch的conf/nutch-site.xml文件

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

	<property>  
	<name>storage.data.store.class</name>  
	<value>org.apache.gora.hbase.store.HBaseStore</value>  
	<description>Default class for storing data</description>  
	</property>  
<property> 
	<name>http.agent.name</name>  
	<value>xxx0624-ThinkPad-Edge</value>  
	</property> 
</configuration>

　　（2）修改ivy/ivy.xml文件

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3"  conf="*->default" />  <!--把该行的注释去掉,使之生效-->

　　（3）修改conf/gola.properies

    gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest  
#gora.sqlstore.jdbc.user=sa  
#gora.sqlstore.jdbc.password=

　　2.3 编译nutch（需在nutch当前目录下进行编译）

cd nutch
ant

　　2.4 编译之后的目录：

.
├── build
├── build.xml
├── build.xml~
├── CHANGES.txt
├── conf
├── default.properties
├── docs
├── ivy
├── lib
├── LICENSE.txt
├── NOTICE.txt
├── README.txt
├── runtime
├── sonar-ant-task-2.1.jar
└── src

7 directories, 8 files

3. 修改nutch配置文件（在第2步中均已完成）

Nutch2.x版本存储采用Gora访问Cassandra、HBase、Accumulo、Avro等，需要在该文件中制定Gora属性。

　3.1修改 conf/nutch-site.xml（第2步中已完成）

<property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.hbase.store.HBaseStore</value>
  <description>Default class for storing data</description>
</property>

　3.2 修改 ivy/ivy.xml（第2步中已完成）

<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

　3.3 修改 conf/gora.properties（第2步中已完成）

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

/*****************************************************************************************************************************/

配置nutch

（nutch文件夹已在/home目录下）

1. 修改系统环境变量

sudo gedit /etc/profile

//增加

#set nutch
export PATH=/home/nutch/runtime/local/bin:$PATH

2. 测试（nutch/runtime/local/bin中./nutch & ./crawl）

nutch

//结果如下：
Usage: nutch COMMAND
where COMMAND is one of:
 inject		inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate 	generate new batches to fetch from crawl db
 fetch 		fetch URLs marked during generate
 parse 		parse URLs marked during fetch
 updatedb 	update web table after parsing
 updatehostdb   update host table after parsing
 readdb 	read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex 	run the solr indexer on parsed batches
 solrdedup 	remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin 	load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit         	runs the given JUnit test
 or
 CLASSNAME 	run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

crawl

//结果如下：
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

　　由于nutch与hbase的使用还会有新的错误出现，故在新文章中记录：nutch集成hbase（http://www.cnblogs.com/xxx0624/p/4176199.html）