Heritrix 3.1.0 源码解析(三)

如果从heritrix3.1.0系统的静态逻辑结构入手,往往看不到系统相关对象的交互作用;如果只从系统的对象动态结构 分析,则又看不到系统的逻辑轮廓

所以源码分析需要动静兼顾,使我们更容易理解它的逻辑与交互,本文采用这个分析方法入手

本文要分析的是spring给Heritrix3.1.0系统bean带来了什么样的管理方式,spring容器的配置文件我们已从上文有了初步的了解

先了解spring容器在系统中是怎样加载配置文件以及怎么初始化的,当我们执行采集任务的build操作时

调用CrawlJob对象的void validateConfiguration()

/**
     * Does the assembled ApplicationContext self-validate? Any failures
     * are reported as WARNING log events in the job log. 
     * 
     * TODO: make these severe? 
     */
    public synchronized void validateConfiguration() {
        instantiateContainer();
        if(ac==null) {
            // fatal errors already encountered and reported
            return; 
        }
        ac.validate();
        HashMap<String,Errors> allErrors = ac.getAllErrors();
        for(String name : allErrors.keySet()) {
            for(Object err : allErrors.get(name).getAllErrors()) {
               LOGGER.log(Level.WARNING,err.toString());
            }
        }
    }

首先加载spring配置文件,初始化spring容器;然后是验证容器

/**
     * Can the configuration yield an assembled ApplicationContext? 
     */
    public synchronized void instantiateContainer() {
        checkXML(); 
        if(ac==null) {
            try {
                ac = new PathSharingContext(new String[] {"file:"+primaryConfig.getAbsolutePath()},false,null);
                ac.addApplicationListener(this);
                ac.refresh();
                getCrawlController(); // trigger NoSuchBeanDefinitionException if no CC
                getJobLogger().log(Level.INFO,"Job instantiated");
            } catch (BeansException be) {
                // Calling doTeardown() and therefore ac.close() here sometimes
                // triggers an IllegalStateException and logs stack trace from
                // within spring, even if ac.isActive(). So, just null it.
                ac = null;
                beansException(be);
            }
        }
    }

上面方法是装载配置文件,添加CrawlJob对象监听器

Heritrix3.1.0的spring容器是经过系统封装的PathSharingContext对象,PathSharingContext类继承自spring的FileSystemXmlApplicationContext类,在它的构造函数里面传入配置文件

public PathSharingContext(String[] configLocations, boolean refresh, ApplicationContext parent) throws BeansException {
        super(configLocations, refresh, parent);
    }

当我们执行采集任务的launch操作时,调用CrawlJob对象的void launch()方法

/**
     * Launch a crawl into 'running' status, assembling if necessary. 
     * 
     * (Note the crawl may have been configured to start in a 'paused'
     * state.) 
     */
    public synchronized void launch() {
        if (isProfile()) {
            throw new IllegalArgumentException("Can't launch profile" + this);
        }
        
        if(isRunning()) {
            getJobLogger().log(Level.SEVERE,"Can't relaunch running job");
            return;
        } else {
            CrawlController cc = getCrawlController();
            if(cc!=null && cc.hasStarted()) {
                getJobLogger().log(Level.SEVERE,"Can't relaunch previously-launched assembled job");
                return;
            }
        }
        
        validateConfiguration();
        if(!hasValidApplicationContext()) {
            getJobLogger().log(Level.SEVERE,"Can't launch problem configuration");
            return;
        }

        //final String job = changeState(j, ACTIVE);
        
        // this temporary thread ensures all crawl-created threads
        // land in the AlertThreadGroup, to assist crawl-wide 
        // logging/alerting
        alertThreadGroup = new AlertThreadGroup(getShortName());
        alertThreadGroup.addLogger(getJobLogger());
        Thread launcher = new Thread(alertThreadGroup, getShortName()+" launchthread") {
            public void run() {
                CrawlController cc = getCrawlController();
                startContext();
                if(cc!=null) {
                    cc.requestCrawlStart();
                }
            }
        };
        getJobLogger().log(Level.INFO,"Job launched");
        scanJobLog();
        launcher.start();
        // look busy (and give startContext/crawlStart a chance)
        try {
            Thread.sleep(1500);
        } catch (InterruptedException e) {
            // do nothing
        }
    }

这里的重要方法是线程对象里面的void startContext()

/**
     * Start the context, catching and reporting any BeansExceptions.
     */
    protected synchronized void startContext() {
        try {
            ac.start(); 
            
            // job log file covering just this launch
            getJobLogger().removeHandler(currentLaunchJobLogHandler);
            File f = new File(ac.getCurrentLaunchDir(), "job.log");
            currentLaunchJobLogHandler = new FileHandler(f.getAbsolutePath(), true);
            currentLaunchJobLogHandler.setFormatter(new JobLogFormatter());
            getJobLogger().addHandler(currentLaunchJobLogHandler);
            
        } catch (BeansException be) {
            doTeardown();
            beansException(be);
        } catch (Exception e) {
            LOGGER.log(Level.SEVERE,e.getClass().getSimpleName()+": "+e.getMessage(),e);
            try {
                doTeardown();
            } catch (Exception e2) {
                e2.printStackTrace(System.err);
            }        
        }
    }

该方法调用PathSharingContext对象的start方法

 @Override
    public void start() {
        initLaunchDir();
        super.start();
    }

在上面方法里面,会执行spring容器里面所有bean(实现Lifecycle接口)的start方法

Lifecycle接口声明的方法如下,定义了bean组件的生命周期

public interface Lifecycle {

    /**
     * Start this component.
     * Should not throw an exception if the component is already running.
     * <p>In the case of a container, this will propagate the start signal
     * to all components that apply.
     */
    void start();

    /**
     * Stop this component.
     * Should not throw an exception if the component isn't started yet.
     * <p>In the case of a container, this will propagate the stop signal
     * to all components that apply.
     */
    void stop();

    /**
     * Check whether this component is currently running.
     * <p>In the case of a container, this will return <code>true</code>
     * only if <i>all</i> components that apply are currently running.
     * @return whether the component is currently running
     */
    boolean isRunning();

}

从这里我们可以知道,Heritrix3.1.0系统是通过spring容器统一管理bean的生命周期(主要是初始化状态)的 

本文通过打印输出了调用了系统哪些bean的start方法

name:scope
name:loggerModule||org.archive.crawler.reporting.CrawlerLoggerModule
name:scope||org.archive.modules.deciderules.DecideRuleSequence

name:candidateScoper
name:candidateScoper||org.archive.crawler.prefetch.CandidateScoper

name:preparer
name:preparer||org.archive.crawler.prefetch.FrontierPreparer

name:candidateProcessors
name:candidateProcessors||org.archive.modules.CandidateChain

name:preselector
name:preselector||org.archive.crawler.prefetch.MyPreselector

name:preconditions
name:bdb||org.archive.bdb.BdbModule
name:serverCache||org.archive.modules.net.BdbServerCache
name:preconditions||org.archive.crawler.prefetch.PreconditionEnforcer

name:fetchDns
name:fetchDns||org.archive.modules.fetcher.FetchDNS

name:fetchHttp
name:cookieStorage||org.archive.modules.fetcher.BdbCookieStorage
name:fetchHttp||org.archive.modules.fetcher.FetchHTTP

name:extractorHttp
name:statisticsTracker||org.archive.crawler.reporting.StatisticsTracker
name:extractorHtml||org.archive.modules.extractor.ExtractorHTML
name:extractorCss||org.archive.modules.extractor.ExtractorCSS
name:extractorJs||org.archive.modules.extractor.ExtractorJS
name:extractorSwf||org.archive.modules.extractor.ExtractorSWF
name:fetchProcessors||org.archive.modules.FetchChain
name:warcWriter||org.archive.modules.writer.MyWriterProcessor
name:candidates||org.archive.crawler.postprocessor.CandidatesProcessor
name:disposition||org.archive.crawler.postprocessor.DispositionProcessor
name:dispositionProcessors||org.archive.modules.DispositionChain
name:crawlController||org.archive.crawler.framework.CrawlController
name:uriUniqFilter||org.archive.crawler.util.BdbUriUniqFilter
name:frontier||org.archive.crawler.frontier.BdbFrontier

name:actionDirectory
name:actionDirectory||org.archive.crawler.framework.ActionDirectory

name:checkpointService
name:checkpointService||org.archive.crawler.framework.CheckpointService

--------------------------------------------------------------------------- 

本系列Heritrix 3.1.0 源码解析系本人原创 

转载请注明出处 博客园 刺猬的温驯 

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025410.html 

原文地址:https://www.cnblogs.com/chenying99/p/3025410.html