Heritrix 3.1.0 源码解析(十八)

从本文开始,我们来分析与Heritrix3.1.0系统的处理器相关的源码,在Heritrix系统里面,待处理的CrawlURI cURI对象经过系统里面的处理器的重重处理最后才得以修成正果

因为处理器很多,除了处理器本身的继承层次的逻辑外,在系统里面将功能相近的处理器归入同一个处理器链

Heritrix3.1.0系统逻辑上抽象为两大处理器链(FetchChain和DispositionChain,CandidateChain逻辑上是属于DispositionChain)

我们先来看一下处理器链与处理器的相关UML图

上面是静态class图,处理器链ProcessorChain维持着一定数目的处理器processor的聚集,处理器链ProcessorChain实现了iterator<E>接口

在系统实际运行时,我上面说系统逻辑上抽象为两大处理器链(FetchChain和DispositionChain,CandidateChain逻辑上是属于DispositionChain)

我来解释一下

处理器链FetchChain(org.archive.modules.FetchChain)对应的处理器(url种子稍有不同,后文再分析):

org.archive.crawler.prefetch.Preselector
org.archive.crawler.prefetch.PreconditionEnforcer
org.archive.modules.fetcher.FetchDNS
org.archive.modules.fetcher.FetchHTTP
org.archive.modules.extractor.ExtractorHTTP
org.archive.modules.extractor.ExtractorHTML
org.archive.modules.extractor.ExtractorCSS
org.archive.modules.extractor.ExtractorJS
org.archive.modules.extractor.ExtractorSWF

处理器链DispositionChain(org.archive.modules.DispositionChain)对应的处理器:

org.archive.modules.writer.MyWriterProcessor
org.archive.crawler.postprocessor.CandidatesProcessor
org.archive.crawler.postprocessor.DispositionProcessor

实际运行中处理器CandidatesProcessor(org.archive.modules.CandidateChain)对应的处理器

org.archive.crawler.prefetch.CandidateScoper
org.archive.crawler.prefetch.FrontierPreparer

如果我们换成在系统中实际运行的对象动态图,可以看出这是一种不完美的composite模式与iterator模式结合,为什么输是不完美的呢

因为处理器链ProcessorChain与处理器processor并没有实现相同的接口(实际上都是process方法[方法签名不同],枝节点与叶节点包含相同的操作方法)

我们先来熟悉一下处理器链ProcessorChain的方法

该类实现了Iterable<Processor>接口,里面覆盖实现iterator()方法迭代自身维持的处理器聚集,相关方法如下

KeyedProperties kp = new KeyedProperties();
    public KeyedProperties getKeyedProperties() {
        return kp;
    }
    
    public int size() {
        return getProcessors().size();
    }

    public Iterator<Processor> iterator() {
        return getProcessors().iterator();
    }

    @SuppressWarnings("unchecked")
    public List<Processor> getProcessors() {
        return (List<Processor>) kp.get("processors");
    }
    public void setProcessors(List<Processor> processors) {
        kp.put("processors",processors);
    }

其中最重要的方法是void process(CrawlURI curi, ChainStatusReceiver thread)迭代处理器,并调用其process方法处理CrawlURI cURI对象

public void process(CrawlURI curi, ChainStatusReceiver thread) throws InterruptedException {
        assert KeyedProperties.overridesActiveFrom(curi);
        String skipToProc = null;        
        
        ploop: for(Processor curProc : this ) {
            if(skipToProc!=null && !curProc.getBeanName().equals(skipToProc)) {
                continue;
            } else {
                skipToProc = null; 
            }
            if(thread!=null) {
                thread.atProcessor(curProc);
            }
            ArchiveUtils.continueCheck();            
                        
            ProcessResult pr = curProc.process(curi);
            switch (pr.getProcessStatus()) {
                case PROCEED:
                    continue;
                case FINISH:
                    break ploop;
                case JUMP:
                    skipToProc = pr.getJumpTarget();
                    continue;
            }
        }        
    }

ChainStatusReceiver thread接口实现类为ToeThread(回调void atProcessor(Processor proc)方法)

ProcessResult pr为处理结果,封装了枚举类型,其值有三 

public enum ProcessStatus {
        /**
         * The URI was processed normally, and no special action needs to 
         * be taken by the framework.
         */
        PROCEED,

        /**
         * The Processor believes that the ProcessorURI is invalid, or 
         * otherwise incapable of further processing at this time. The 
         * chain should skip subsequent processors, returning the URI.
         */
        FINISH,

        /**
         * The Processor has specified the next processor for the URI.  The 
         * china should skip forward to that processor instead of the reguarly
         * scheduled next processor.
         */
        JUMP,
    }

处理器链ProcessorChain有三个继承类,分别为FetchChain、DispositionChain、CandidateChain

三者没有覆盖任何方法,Heritrix3.1.0大概是为了处理器链ProcessorChain对处理器聚集的逻辑分组

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3036879.html

原文地址:https://www.cnblogs.com/chenying99/p/3036879.html