Injector Job深入分析 分类: H3_NUTCH 2015-03-10 15:44 334人阅读 评论(0) 收藏



Injector Job的主要功能是根据crawlId在hbase中创建一个表,将将文本中的seed注入表中。
(一)命令执行
1、运行命令
[jediael@master local]$ bin/nutch inject seeds/ -crawlId sourcetest
InjectorJob: starting at 2015-03-10 14:59:19
InjectorJob: Injecting urlDir: seeds
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2015-03-10 14:59:26, elapsed: 00:00:06

2、查看表中内容
hbase(main):004:0> scan 'sourcetest_webpage'
ROW                                       COLUMN+CELL                                                                                                           
 com.163.money:http/                      column=f:fi, timestamp=1425970761871, value=x00'x8Dx00                                                            
 com.163.money:http/                      column=f:ts, timestamp=1425970761871, value=x00x00x01Lx02{x08_                                                   
 com.163.money:http/                      column=mk:_injmrk_, timestamp=1425970761871, value=y                                                                 
 com.163.money:http/                      column=mk:dist, timestamp=1425970761871, value=0                                                                      
 com.163.money:http/                      column=mtdt:_csh_, timestamp=1425970761871, value=?x80x00x00                                                       
 com.163.money:http/                      column=s:s, timestamp=1425970761871, value=?x80x00x00                                                             
1 row(s) in 0.0430 seconds

3、读取数据库中的内容
由于hbase表使用了字节码表示内容,因此需要通过以下命令来查看具体内容
[jediael@master local]$ bin/nutch readdb  -dump ./test -crawlId sourcetest -content
WebTable dump: starting
WebTable dump: done
[jediael@master local]$ cat test/part-r-00000
http://money.163.com/   key:    com.163.money:http/
baseUrl:        null
status: 0 (null)
fetchTime:      1425970759775
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:      0
modifiedTime:   0
prevModifiedTime:       0
protocolStatus: (null)
parseStatus:    (null)
title:  null
score:  1.0
marker _injmrk_ :       y
marker dist :   0
reprUrl:        null
metadata _csh_ :        ?锟


(二)源码流程分析
类:org.apache.nutch.crawl.InjectorJob
1、程序入口
 
public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(),
        args);
    System.exit(res);
  }

2、ToolRunner.run(String[] args)
此步骤主要是调用inject方法,其余均是一些参数合规性的检查
 
public int run(String[] args) throws Exception {
  …………
    inject(new Path(args[0]));
   …………
  }


3、inject()方法
nutch均使用 Map<String, Object> run(Map<String, Object> args)来运行具体的job,即其使用Map类参数,并返回Map类参数。
<pre name="code" class="java">public void inject(Path urlDir) throws Exception {

    run(ToolUtil.toArgMap(Nutch.ARG_SEEDDIR, urlDir));

  }





4、job的具体配置,并创建hbase中的表格
public Map<String, Object> run(Map<String, Object> args) throws Exception {
   
    numJobs = 1;
    currentJobNum = 0;
    currentJob = new NutchJob(getConf(), "inject " + input);
    FileInputFormat.addInputPath(currentJob, input);
    currentJob.setMapperClass(UrlMapper.class);
    currentJob.setMapOutputKeyClass(String.class);
    currentJob.setMapOutputValueClass(WebPage.class);
    currentJob.setOutputFormatClass(GoraOutputFormat.class);

    DataStore<String, WebPage> store = StorageUtils.createWebStore(
        currentJob.getConfiguration(), String.class, WebPage.class);
    GoraOutputFormat.setOutput(currentJob, store, true);

    currentJob.setReducerClass(Reducer.class);
    currentJob.setNumReduceTasks(0);

    currentJob.waitForCompletion(true);
    ToolUtil.recordJobStatus(null, currentJob, results);
}


  


5、mapper方法
由于Injector Job中无reducer,因此只要关注mapper即可。
mapper主要完成以下几项工作:
(1)对文本中的内容进行分析,并提取其中的参数
(2)根据filter过滤url
(3)反转url作为key,创建Webpage对象作为value,然后将之写入表中。
protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
      String url = value.toString().trim(); // value is line of text

      if (url != null && (url.length() == 0 || url.startsWith("#"))) {
        /* Ignore line that start with # */
        return;
      }

      // if tabs : metadata that could be stored
      // must be name=value and separated by 	
      float customScore = -1f;
      int customInterval = interval;
      Map<String, String> metadata = new TreeMap<String, String>();
      if (url.indexOf("	") != -1) {
        String[] splits = url.split("	");
        url = splits[0];
        for (int s = 1; s < splits.length; s++) {
          // find separation between name and value
          int indexEquals = splits[s].indexOf("=");
          if (indexEquals == -1) {
            // skip anything without a =
            continue;
          }
          String metaname = splits[s].substring(0, indexEquals);
          String metavalue = splits[s].substring(indexEquals + 1);
          if (metaname.equals(nutchScoreMDName)) {
            try {
              customScore = Float.parseFloat(metavalue);
            } catch (NumberFormatException nfe) {
            }
          } else if (metaname.equals(nutchFetchIntervalMDName)) {
            try {
              customInterval = Integer.parseInt(metavalue);
            } catch (NumberFormatException nfe) {
            }
          } else
            metadata.put(metaname, metavalue);
        }
      }
      try {
        url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
        url = filters.filter(url); // filter the url
      } catch (Exception e) {
        LOG.warn("Skipping " + url + ":" + e);
        url = null;
      }
      if (url == null) {
        context.getCounter("injector", "urls_filtered").increment(1);
        return;
      } else { // if it passes
        String reversedUrl = TableUtil.reverseUrl(url); // collect it
        WebPage row = WebPage.newBuilder().build();
        row.setFetchTime(curTime);
        row.setFetchInterval(customInterval);

        // now add the metadata
        Iterator<String> keysIter = metadata.keySet().iterator();
        while (keysIter.hasNext()) {
          String keymd = keysIter.next();
          String valuemd = metadata.get(keymd);
          row.getMetadata().put(new Utf8(keymd),
              ByteBuffer.wrap(valuemd.getBytes()));
        }

        if (customScore != -1)
          row.setScore(customScore);
        else
          row.setScore(scoreInjected);

        try {
          scfilters.injectedScore(url, row);
        } catch (ScoringFilterException e) {
          if (LOG.isWarnEnabled()) {
            LOG.warn("Cannot filter injected score for url " + url
                + ", using default (" + e.getMessage() + ")");
          }
        }
        context.getCounter("injector", "urls_injected").increment(1);
        row.getMarkers()
            .put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0)));
        Mark.INJECT_MARK.putMark(row, YES_STRING);
        context.write(reversedUrl, row);
      }
    }



(三)重点源码学习


版权声明:本文为博主原创文章,未经博主允许不得转载。

原文地址:https://www.cnblogs.com/lujinhong2/p/4637207.html