nutch 1.7 导入 eclipse


开发环境建议:ubuntu+eclipse (windows + cygwin + eclipse不推荐)

第一步:下载
http://archive.apache.org/dist/nutch/
从上述站点下载src和bin两个压缩文件
wget 'http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-bin.tar.gz'
wget 'http://archive.apache.org/dist/nutch/1.7/apache-nutch-1.7-src.tar.gz'

第二步:解压
tar zxvf apache-nutch-1.7-bin.tar.gz
解压出一个 apache-nutch-1.7 文件夹
重命名: mv apache-nutch-1.7 apache-nutch-1.7-bin

tar zxvf apache-nutch-1.7-src.tar.gz
解压出一个 apache-nutch-1.7 文件夹
重命名: mv apache-nutch-1.7 apache-nutch-1.7-src

第三步:组合
将apache-nutch-1.7-bin/lib中的所有jar包拷贝到apache-nutch-1.7-src/lib中
cp apache-nutch-1.7-bin/lib/* apache-nutch-1.7-src/lib/
将apache-nutch-1.7-bin/conf中的配置文件覆盖apache-nutch-1.7-src/conf中


第四步:导入eclipse
eclipse : File -- New -- Java Project

这一步完成了将源码(而非工程)导入eclipse
注解:笔者以前用的eclipse版本有import project from source ,但这个版本没有,只有import project from existing project.而我们只有src文件

点击NEXT
找到 conf 文件夹 ,然后点击 Add Folder 'conf' to build path
defautl output 设置为 apache-nutch-1.7/bin

点击Finish

第四步:一些小BUG
此时会发现工程有错误(红色的小叉叉),这是因为缺少引用导致的。
以parse-html为例:
import org.cyberneko.html.parsers.*;
这里报错是因为缺少 nekohtml-0.9.5.jar

如何获取nekohtml-0.9.5.jar:
到apache-nutch-1.7-bin/plugin 下搜索 nekohtml 就能找到这个jar包
然后复制到项目的lib文件夹里并add to build path

其他bug以此类推(所有的jar都可以在apache-nutch-1.7-bin/plugin 下找到

feed
cp apache-nutch-1.7-bin/plugins/feed/rome-0.9.jar apache-nutch-1.7-src/lib/
parse-html
cp apache-nutch-1.7-bin/plugins/parse-html/tagsoup-1.2.1.jar apache-nutch-1.7-src/lib/
cp apache-nutch-1.7-bin/plugins/lib-nekohtml/nekohtml-0.9.5.jar apache-nutch-1.7-src/lib/



至此整个工程将不会有任何错误了。

第五步:测试采集
1.vim conf/nutch-defalut.xml -----vim
/plugin.forlder ---vim查找命令
修该为:
<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

原因:源代码文件中 plugin在src文件夹里,但在bin文件中plugin 在根目录下。

  2 vim conf/nutch-site.xml    加入:
    <property>
    <name>http.agent.name</name>
    <value>your sipder name</value>
  </property>

3 在apache-nutch-1.7-src下建立一个urls文件夹,在urls下面建一个文本文档
mkdir urls
cd urls
vim seed.txt
写入:http://www.163.com/

4 vim conf/regex-urlfilter.txt
5 运行配置:

运行结果:

至此运行成功。
检测采集结果:

 

统计结果:(unfetched比较多是因为nutch给url打分,过滤掉了分数小于0的,这个可以在nutch-default.xml中修改)

2013-09-22 13:17:46,710 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(351)) - Statistics for CrawlDb: crawl/crawldb
2013-09-22 13:17:46,710 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(354)) - TOTAL urls:    794
2013-09-22 13:17:46,710 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(369)) - retry 0:    794
2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(359)) - min score:    0.0
2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(363)) - avg score:    0.003186398
2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(361)) - max score:    1.007
2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 1 (db_unfetched):    750
2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    3g.163.com :    7
2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    auto.163.com :    12
2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    baby.163.com :    1
2013-09-22 13:17:46,711 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    baoxian.163.com :    2
2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.163.com :    1
2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.culture.163.com :    1
2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.ent.163.com :    1
2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    bbs.lady.163.com :    1
2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    biz.163.com :    1
2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    blog.163.com :    2
2013-09-22 13:17:46,712 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    book.163.com :    10
2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    caipiao.163.com :    50
2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    cbachina.163.com :    1
2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    club.auto.163.com :    1
2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    corp.163.com :    3
2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    data.ent.163.com :    1
2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    digi.163.com :    40
2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    discovery.163.com :    1
2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    dl.163.com :    1
2013-09-22 13:17:46,713 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    ecard.163.com :    1
2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    edu.163.com :    1
2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    email.163.com :    1
2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    emarketing.biz.163.com :    31
2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    ent.163.com :    10
2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    expo.163.com :    1
2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    fashion.163.com :    1
2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    focus.news.163.com :    1
2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    fushi.163.com :    1
2013-09-22 13:17:46,714 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    game.163.com :    1
2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    gb.corp.163.com :    8
2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    hea.163.com :    3
2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    help.163.com :    2
2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    history.news.163.com :    1
2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    home.163.com :    1
2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    house.163.com :    1
2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    hr.163.com :    1
2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    jiu.163.com :    1
2013-09-22 13:17:46,715 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    kf.yxp.163.com :    1
2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    lady.163.com :    8
2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    live.caipiao.163.com :    1
2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    love.163.com :    2
2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    lovegongyi.163.com :    1
2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    m.163.com :    81
2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    media.163.com :    1
2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    mibao.gm.163.com :    1
2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    mobile.163.com :    2
2013-09-22 13:17:46,716 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    money.163.com :    21
2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    news.163.com :    9
2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    news.tag.163.com :    1
2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    newsapp.blog.163.com :    19
2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    pay.163.com :    1
2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    pic.auto.163.com :    1
2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    post.news.163.com :    1
2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    product.auto.163.com :    1
2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    qiye.163.com :    1
2013-09-22 13:17:46,717 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    quotes.money.163.com :    1
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    reg.163.com :    3
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    sports.163.com :    14
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    survey2.163.com :    1
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    t.163.com :    45
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    tech.163.com :    42
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    travel.163.com :    1
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    tveasy.blog.163.com :    2
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.163.com :    1
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.money.163.com :    1
2013-09-22 13:17:46,718 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.news.163.com :    1
2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    v.sports.163.com :    1
2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    vipmail.163.com :    1
2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    vs.caipiao.163.com :    1
2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    wangyiyuedu.blog.163.com :    1
2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    war.news.163.com :    1
2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    www.163.com :    1
2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    yuedu.163.com :    265
2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    zx.caipiao.163.com :    7
2013-09-22 13:17:46,719 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    zz.yc.163.com :    3
2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 2 (db_fetched):    40
2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    caipiao.163.com :    1
2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    corp.163.com :    2
2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    digi.163.com :    1
2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    emarketing.biz.163.com :    1
2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    gb.corp.163.com :    3
2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    help.163.com :    1
2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    love.163.com :    1
2013-09-22 13:17:46,720 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    m.163.com :    11
2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    music.163.com :    1
2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    newsapp.blog.163.com :    1
2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    open.163.com :    1
2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    open.yuedu.163.com :    1
2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    sitemap.163.com :    1
2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    t.163.com :    1
2013-09-22 13:17:46,721 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    tech.163.com :    2
2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    www.163.com :    1
2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    yuedu.163.com :    9
2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    zz.yc.163.com :    1
2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 4 (db_redir_temp):    2
2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    3g.163.com :    1
2013-09-22 13:17:46,722 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    m.163.com :    1
2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(368)) - status 5 (db_redir_perm):    2
2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    caipiao.163.com :    1
2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(367)) -    corp.163.com :    1
2013-09-22 13:17:46,723 INFO  crawl.CrawlDbReader (CrawlDbReader.java:processStatJob(374)) - CrawlDb statistics: done
原文地址:https://www.cnblogs.com/i80386/p/3324068.html