Nutch分类搜索

环境

ubuntu11.10

tomcat6.0.35

nutch1.2

笔者想到的分类搜索的方法是根据不同的url建立不同的抓取库，比如要搞电力行业的垂直的搜索，可以将他分为新闻，产品，人才。那麽就建立三个抓取库，每个抓取库都有自己的url入口地址列表。然后配置网站过滤规则达到想要的结果。

下面笔者将一步一步的讲解他的实现过程。

首先先要得到相关类别的url入口地址列表，这个可以分类百度一下然后根据结果自己整理出来3个列表。

以下是笔者整理的三个列表。

新闻的（文件名newsURL）

http://www.cpnn.com.cn/

http://news.bjx.com.cn/

http://www.chinapower.com.cn/news/

http://news.bjx.com.cn/

产品的（文件名productURL）

http://www.powerproduct.com/

http://www.epapi.com/

http://cnc.powerproduct.com/

人才的（文件名talentURl）

http://www.cphr.com.cn/

http://www.ephr.com.cn/

http://www.myepjob.com/

http://www.epjob88.com/

http://hr.bjx.com.cn/

http://www.epjob.com.cn/

http://ep.baidajob.com/

http://www.01hr.com/

因为是做测试用，所以就不弄太多的地址了。

做垂直搜索就不能在用nutchcrawl url -dir crawl -depth -topN -threads命令来抓取了，这个命令是做企业内部搜索的，而且不能增量抓取。在这里笔者采用别人已经写好的增量抓取脚本。

地址http://wiki.apache.org/nutch/Crawl

因为要建三个抓取库所以要将该脚本修给一下。笔者的抓取库放在/crawldb/news /crawldb/product

/crawldb/talent，而且将三个url入口文件分别放到相应的分类下面/crawldb/news/newsURL

/crawldb/product/productURl /crawldb/talent/talentURl。下面是笔者修改后的抓取脚本。使用该脚本要配置NUTCH_HOME，CATALINA_HOME环境变量。

#!/bin/bash

#############################电力新闻抓增量取部分################################runbot script to run the Nutch bot for crawling and re-crawling.

#Usage: bin/runbot [safe]

# If executed in 'safe' mode, it doesn't delete the temporary

# directories generated during crawl. This might be helpful for

# analysis and recovery in case a crawl fails.

#Author: Susam Pal

echo"-----开始电力新闻增量抓取-----"

cd/crawldb/news

depth=5

threads=100

adddays=5

topN=5000#Comment this statement if you don't want to set topN value

#Arguments for rm and mv

RMARGS="-rf"

MVARGS="--verbose"

#Parse arguments

if[ "$1" == "safe" ]

then

safe=yes

if[ -z "$NUTCH_HOME" ]

then

NUTCH_HOME=.

echorunbot: $0 could not find environment variable NUTCH_HOME

echorunbot: NUTCH_HOME=$NUTCH_HOME has been set by the script

else

echorunbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME

if[ -z "$CATALINA_HOME" ]

then

CATALINA_HOME=/opt/apache-tomcat-6.0.10

echorunbot: $0 could not find environment variable NUTCH_HOME

echorunbot: CATALINA_HOME=$CATALINA_HOME has been set by the script

else

echorunbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME

if[ -n "$topN" ]

then

topN="-topN$topN"

else

topN=""

steps=8

echo"----- Inject (Step 1 of $steps) -----"

$NUTCH_HOME/bin/nutchinject crawl/crawldb urls

echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

for((i=0;i < $depth; i++))

echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"

$NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \

-adddays$adddays

if[ $? -ne 0 ]

then

echo"runbot: Stopping at depth $depth. No more URLs to fetch."

break

segment=`ls-d crawl/segments/* | tail -1`

$NUTCH_HOME/bin/nutchfetch $segment -threads $threads

if[ $? -ne 0 ]

then

echo"runbot: fetch $segment at depth `expr $i + 1` failed."

echo"runbot: Deleting segment $segment."

rm$RMARGS $segment

continue

$NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment

done

echo"----- Merge Segments (Step 3 of $steps) -----"

$NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*

if[ "$safe" != "yes" ]

then

rm$RMARGS crawl/segments

else

rm$RMARGS crawl/BACKUPsegments

mv$MVARGS crawl/segments crawl/BACKUPsegments

mv$MVARGS crawl/MERGEDsegments crawl/segments

echo"----- Invert Links (Step 4 of $steps) -----"

$NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*

echo"----- Index (Step 5 of $steps) -----"

$NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \

crawl/segments/*

echo"----- Dedup (Step 6 of $steps) -----"

$NUTCH_HOME/bin/nutchdedup crawl/NEWindexes

echo"----- Merge Indexes (Step 7 of $steps) -----"

$NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes

echo"----- Loading New Index (Step 8 of $steps) -----"

if[ "$safe" != "yes" ]

then

rm$RMARGS crawl/NEWindexes

rm$RMARGS crawl/index

else

rm$RMARGS crawl/BACKUPindexes

rm$RMARGS crawl/BACKUPindex

mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes

mv$MVARGS crawl/index crawl/BACKUPindex

mv$MVARGS crawl/NEWindex crawl/index

echo"runbot: FINISHED: -----电力新闻增量抓取完毕!-----"

echo""

#############################电力产品增量抓取部分################################

echo"-----开始电力产品增量抓取-----"

cd/crawldb/product

steps=8

echo"----- Inject (Step 1 of $steps) -----"

$NUTCH_HOME/bin/nutchinject crawl/crawldb urls

echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

for((i=0;i < $depth; i++))

echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"

$NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \

-adddays$adddays

if[ $? -ne 0 ]

then

echo"runbot: Stopping at depth $depth. No more URLs to fetch."

break

segment=`ls-d crawl/segments/* | tail -1`

$NUTCH_HOME/bin/nutchfetch $segment -threads $threads

if[ $? -ne 0 ]

then

echo"runbot: fetch $segment at depth `expr $i + 1` failed."

echo"runbot: Deleting segment $segment."

rm$RMARGS $segment

continue

$NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment

done

echo"----- Merge Segments (Step 3 of $steps) -----"

$NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*

if[ "$safe" != "yes" ]

then

rm$RMARGS crawl/segments

else

rm$RMARGS crawl/BACKUPsegments

mv$MVARGS crawl/segments crawl/BACKUPsegments

mv$MVARGS crawl/MERGEDsegments crawl/segments

echo"----- Invert Links (Step 4 of $steps) -----"

$NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*

echo"----- Index (Step 5 of $steps) -----"

$NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \

crawl/segments/*

echo"----- Dedup (Step 6 of $steps) -----"

$NUTCH_HOME/bin/nutchdedup crawl/NEWindexes

echo"----- Merge Indexes (Step 7 of $steps) -----"

$NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes

echo"----- Loading New Index (Step 8 of $steps) -----"

if[ "$safe" != "yes" ]

then

rm$RMARGS crawl/NEWindexes

rm$RMARGS crawl/index

else

rm$RMARGS crawl/BACKUPindexes

rm$RMARGS crawl/BACKUPindex

mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes

mv$MVARGS crawl/index crawl/BACKUPindex

mv$MVARGS crawl/NEWindex crawl/index

echo"runbot: FINISHED:-----电力产品增量抓取完毕!-----"

echo""

###############################电力人才增量抓取部分############################

echo"-----开始电力人才增量抓取!-----"

cd/crawldb/talent

steps=8

echo"----- Inject (Step 1 of $steps) -----"

$NUTCH_HOME/bin/nutchinject crawl/crawldb urls

echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

for((i=0;i < $depth; i++))

echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"

$NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \

-adddays$adddays

if[ $? -ne 0 ]

then

echo"runbot: Stopping at depth $depth. No more URLs to fetch."

break

segment=`ls-d crawl/segments/* | tail -1`

$NUTCH_HOME/bin/nutchfetch $segment -threads $threads

if[ $? -ne 0 ]

then

echo"runbot: fetch $segment at depth `expr $i + 1` failed."

echo"runbot: Deleting segment $segment."

rm$RMARGS $segment

continue

$NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment

done

echo"----- Merge Segments (Step 3 of $steps) -----"

$NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*

if[ "$safe" != "yes" ]

then

rm$RMARGS crawl/segments

else

rm$RMARGS crawl/BACKUPsegments

mv$MVARGS crawl/segments crawl/BACKUPsegments

mv$MVARGS crawl/MERGEDsegments crawl/segments

echo"----- Invert Links (Step 4 of $steps) -----"

$NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*

echo"----- Index (Step 5 of $steps) -----"

$NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \

crawl/segments/*

echo"----- Dedup (Step 6 of $steps) -----"

$NUTCH_HOME/bin/nutchdedup crawl/NEWindexes

echo"----- Merge Indexes (Step 7 of $steps) -----"

$NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes

echo"----- Loading New Index (Step 8 of $steps) -----"

${CATALINA_HOME}/bin/shutdown.sh

if[ "$safe" != "yes" ]

then

rm$RMARGS crawl/NEWindexes

rm$RMARGS crawl/index

else

rm$RMARGS crawl/BACKUPindexes

rm$RMARGS crawl/BACKUPindex

mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes

mv$MVARGS crawl/index crawl/BACKUPindex

mv$MVARGS crawl/NEWindex crawl/index

${CATALINA_HOME}/bin/startup.sh

echo"runbot: FINISHED:-----电力人才增量抓取完毕!-----"

echo""

将上面的代码复制到你的linux上，然后给他可执行的权限chmod755 。

下载还不能抓取页面，要在$NUTCH_HOME/conf/regex.urlfilter.txt中配置url过滤规则

我的配置如下

#Licensed to the Apache Software Foundation (ASF) under one or more

#contributor license agreements. See the NOTICE file distributedwith

#this work for additional information regarding copyright ownership.

#The ASF licenses this file to You under the Apache License, Version2.0

#(the "License"); you may not use this file except incompliance with

#the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

#Unless required by applicable law or agreed to in writing, software

#distributed under the License is distributed on an "AS IS"BASIS,

#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express orimplied.

#See the License for the specific language governing permissions and

#limitations under the License.

#The default url filter.

#Better for whole-internet crawling.

#Each non-comment, non-blank line contains a regular expression

#prefixed by '+' or '-'. The first matching pattern in the file

#determines whether a URL is included or ignored. If no pattern

#matches, the URL is ignored.

#skip file: ftp: and mailto: urls

-^(file|ftp|mailto):

#skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

#skip URLs containing certain characters as probable queries, etc.

+[?*!@=]

#skip URLs with slash-delimited segment that repeats 3+ times, tobreak loops

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

-.*\.js

#accept anything else

+^http://([a-z0-9]*\.)*cpnn.com.cn/

+^http://([a-z0-9]*\.)*cphr.com.cn/

+^http://([a-z0-9]*\.)*powerproduct.com/

+^http://([a-z0-9]*\.)*bjx.com.cn/

+^http://([a-z0-9]*\.)*renhe.cn/

+^http://([a-z0-9]*\.)*chinapower.com.cn/

+^http://([a-z0-9]*\.)*ephr.com.cn/

+^http://([a-z0-9]*\.)*epapi.com/

+^http://([a-z0-9]*\.)*myepjob.com/

+^http://([a-z0-9]*\.)*epjob88.com/

+^http://([a-z0-9]*\.)*xindianli.com/

+^http://([a-z0-9]*\.)*epjob.com.cn/

+^http://([a-z0-9]*\.)*baidajob.com/

+^http://([a-z0-9]*\.)*01hr.com/

接下来配置$NUTCH_HOME/conf/nutch-site.xml如下

<?xmlversion="1.0"?>

<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<name>http.agent.name</name>

<value>justa test</value>

</property>

</configuration>

上述步骤都成功了的话，就可以用刚才的脚本抓取了。这里要注意你的抓取数据的存放目录，请在抓取脚本的相应位置做出更改，以适应你的目录结构。

抓取完成后就是要将搭建搜索环境了。

将nutch目录下的war包放到tomcat的webapps目录下，待其自己解压。将ROOT该目录下已有的东西删掉，将刚才解压目录中的东西复制到其中，并修改WEB-INF/classes/nutch-site.xml文件如下

<?xmlversion="1.0"?>

<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<name>searcher.dir</name>

<value>/crawldb/news/crawl</value>

</property>

<name>http.agent.name</name>

<value>tangmiSpider</value>

<description>MySearch Engine</description>

</property>

<name>plugin.includes</name>

</property>

</configuration>

其中的search.dir的值是你的抓取数据的存放目录，请做出相应的更改。在webapps目录下建立两个目录talent、product，将刚才解压目录中的东西复制到其中，并修改WEB-INF/classes/nutch-site.xml，将searcher.dir的分别设置为/crawldb/talent/crawl、/crawldb/product/crawl。至此就可以进行分类搜索了。进行搜索是请进输入相应的url。

我的结果页面