nutch 写一个indexingfilter插件

参考源:http://blog.csdn.net/amuseme_lu/article/details/6780244


1 生成一个与urlfilter-regex类似的包结构

代码路径的生成:http://www.cnblogs.com/i80386/archive/2012/09/04/2670670.html


2

public class MyIndexingFilter  implements IndexingFilter {

    public static final Log LOG = LogFactory.getLog(MyIndexingFilter.class);
    private Configuration conf;
    public void addIndexBackendOptions(Configuration conf) {
        LuceneWriter.addFieldOptions("mt", LuceneWriter.STORE.YES,
                LuceneWriter.INDEX.TOKENIZED, conf);
    }
    private NutchDocument addMyField(NutchDocument doc)  
     {  
        System.out.println("银河系");
        String value="银河系";
        doc.add("mt",value);  //这里我设置了一个固定字段,实际应该从html抽取目标字段
        return doc;  
     }  
    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
            CrawlDatum datum, Inlinks inlinks) throws IndexingException {
        addMyField(doc);
        return doc;
    }
    public Configuration getConf() {
        return this.conf;
    }
    public void setConf(Configuration arg0) {
        this.conf = arg0;
    }
}

3 生成jar包       build fat jar

4 生成plugin.xml

<plugin
   id="index-myfield"
   name="my Indexing Filter"
   version="1.0.0"
   provider-name="nutch.org">


   <runtime>
      <library name="myfield.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="org.apache.nutch.indexer.myfield"
              name="Nutch My Indexing Filter"
              point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="MyIndexingFilter"
                      class="org.apache.nutch.indexer.myfield.MyIndexingFilter"/>
   </extension>

</plugin>

5 最后把打好的jar包与plugin.xml放到E:\nutch\src\plugin\index-myfield 文件夹中

6 修改conf\nutch-site.xml

<configuration>
<property>
        <name>searcher.dir</name>
        <value>E:/crawl_2</value>
</property>
    <property>  
      <name>plugin.includes</name>  
      <value>protocol-http|urlfilter-(regex|prefix|my)|parse-(html|tika)|index-(basic|anchor|myfield)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>  
      <description>Regular expression naming plugin directory names to  
      include.  Any plugin not matching this expression is excluded.  
      In any case you need at least include the nutch-extensionpoints plugin. By  
      default Nutch includes crawling just HTML and plain text via HTTP,  
      and basic indexing and search plugins. In order to use HTTPS please enable   
      protocol-httpclient, but be aware of possible intermittent problems with the   
      underlying commons-httpclient library.  
      </description>  
    </property>  
</configuration>

7 启动nutch

8 在solr中检索

9 可以检索到我们需要的字段


注:如果我不是手动打jar放到 index-myfield文件夹中 ,而是直接修改nutch-site.xml 添加了 index-(basic|anchor|myfield)

原文地址:https://www.cnblogs.com/i80386/p/2678466.html