Heritrix 3.1.0 源码解析(二十)

本文接着上文分析,CandidateChain candidateChain处理器链相关联的处理器

CandidateChain处理器链有两个处理器

org.archive.crawler.prefetch.CandidateScoper

org.archive.crawler.prefetch.FrontierPreparer

要了解上面的处理器,我们先要了解另外一个抽象类Scoper,继承自抽象父类Processor,该类用来控制CrawlURI caUri对象的范围,里面有一个成员变量DecideRule scope

 protected DecideRule scope;
    public DecideRule getScope() {
        return this.scope;
    }
    @Autowired
    public void setScope(DecideRule scope) {
        this.scope = scope;
    }

该类重要的方法如下(调用成员变量DecideRule scope的DecideResult decisionFor(CrawlURI uri)方法

/**
     * Schedule the given {@link CrawlURI CrawlURI} with the Frontier.
     * @param caUri The CrawlURI to be scheduled.
     * @return true if CrawlURI was accepted by crawl scope, false
     * otherwise.
     */
    protected boolean isInScope(CrawlURI caUri) {
        boolean result = false;
        //System.out.println(this.getClass().getName()+":"+"scope name:"+scope.getClass().getName());
        DecideResult dr = scope.decisionFor(caUri);
        if (dr == DecideResult.ACCEPT) {
            result = true;
            if (fileLogger != null) {
                fileLogger.info("ACCEPT " + caUri); 
            }
        } else {
            outOfScope(caUri);
        }
        return result;
    }
    
    /**
     * Called when a CrawlURI is ruled out of scope.
     * Override if you don't want logs as coming from this class.
     * @param caUri CrawlURI that is out of scope.
     */
    protected void outOfScope(CrawlURI caUri) {
        if (fileLogger != null) {
            fileLogger.info("REJECT " + caUri); 
        }
    }

该类的子类调用上面的方法判断CrawlURI caUri对象是否溢出范围,CandidateScoper类和FrontierPreparer类都是它的子类,另外还有Preselector类等

CandidateScoper类代码很简单,覆盖Processor类的ProcessResult innerProcessResult(CrawlURI curi)方法

@Override
    protected ProcessResult innerProcessResult(CrawlURI curi) throws InterruptedException {
        if (!isInScope(curi)) {
            // Scope rejected
            curi.setFetchStatus(S_OUT_OF_SCOPE);
            return ProcessResult.FINISH;
        }
        return ProcessResult.PROCEED;
    }

表达式!isInScope(curi)调用父类抽象类Scoper的方法判断当前CrawlURI curi对象是否溢出范围

FrontierPreparer类主要是为CrawlURI curi对象设置相关值,为抓取数据做准备(没发现该类调用父类抽象类Scoper方法

 /* (non-Javadoc)
     * @see org.archive.modules.Processor#innerProcess(org.archive.modules.CrawlURI)
     */
    @Override
    protected void innerProcess(CrawlURI curi) {
        prepare(curi);
    }
    
    /**
     * Apply all configured policies to CrawlURI
     * 
     * @param curi CrawlURI
     */
    public void prepare(CrawlURI curi) {
        
        // set schedulingDirective
        curi.setSchedulingDirective(getSchedulingDirective(curi));
            
        // set canonicalized version
        curi.setCanonicalString(canonicalize(curi));
        
        // set queue key
        curi.setClassKey(getClassKey(curi));
        
        // set cost
        curi.setHolderCost(getCost(curi));
        
        // set URI precedence
        getUriPrecedencePolicy().uriScheduled(curi);


    }

上面的方法void prepare(CrawlURI curi)为CrawlURI curi对象设置相关值,计算相应值的方法如下 

/**
     * Calculate the coarse, original 'schedulingDirective' prioritization
     * for the given CrawlURI
     * 
     * @param curi
     * @return
     */
    protected int getSchedulingDirective(CrawlURI curi) {
        if(StringUtils.isNotEmpty(curi.getPathFromSeed())) {
            char lastHop = curi.getPathFromSeed().charAt(curi.getPathFromSeed().length()-1);
            if(lastHop == 'R') {
                // refer
                return getPreferenceDepthHops() >= 0 ? HIGH : MEDIUM;
            } 
        }
        if (getPreferenceDepthHops() == 0) {
            return HIGH;
            // this implies seed redirects are treated as path
            // length 1, which I belive is standard.
            // curi.getPathFromSeed() can never be null here, because
            // we're processing a link extracted from curi
        } else if (getPreferenceDepthHops() > 0 && 
            curi.getPathFromSeed().length() + 1 <= getPreferenceDepthHops()) {
            return HIGH;
        } else {
            // optionally preferencing embeds up to MEDIUM
            int prefHops = getPreferenceEmbedHops(); 
            if (prefHops > 0) {
                int embedHops = curi.getTransHops();
                if (embedHops > 0 && embedHops <= prefHops
                        && curi.getSchedulingDirective() == SchedulingConstants.NORMAL) {
                    // number of embed hops falls within the preferenced range, and
                    // uri is not already MEDIUM -- so promote it
                    return MEDIUM;
                }
            }
            // Everything else stays as previously assigned
            // (probably NORMAL, at least for now)
            return curi.getSchedulingDirective();
        }
    }
    /**
     * Canonicalize passed CrawlURI. This method differs from
     * {@link #canonicalize(UURI)} in that it takes a look at
     * the CrawlURI context possibly overriding any canonicalization effect if
     * it could make us miss content. If canonicalization produces an URL that
     * was 'alreadyseen', but the entry in the 'alreadyseen' database did
     * nothing but redirect to the current URL, we won't get the current URL;
     * we'll think we've already see it. Examples would be archive.org
     * redirecting to www.archive.org or the inverse, www.netarkivet.net
     * redirecting to netarkivet.net (assuming stripWWW rule enabled).
     * <p>Note, this method under circumstance sets the forceFetch flag.
     * 
     * @param cauri CrawlURI to examine.
     * @return Canonicalized <code>cacuri</code>.
     */
    protected String canonicalize(CrawlURI cauri) {
        String canon = getCanonicalizationPolicy().canonicalize(cauri.getURI());
        if (cauri.isLocation()) {
            // If the via is not the same as where we're being redirected (i.e.
            // we're not being redirected back to the same page, AND the
            // canonicalization of the via is equal to the the current cauri, 
            // THEN forcefetch (Forcefetch so no chance of our not crawling
            // content because alreadyseen check things its seen the url before.
            // An example of an URL that redirects to itself is:
            // http://bridalelegance.com/images/buttons3/tuxedos-off.gif.
            // An example of an URL whose canonicalization equals its via's
            // canonicalization, and we want to fetch content at the
            // redirection (i.e. need to set forcefetch), is netarkivet.dk.
            if (!cauri.toString().equals(cauri.getVia().toString()) &&
                    getCanonicalizationPolicy().canonicalize(
                            cauri.getVia().toCustomString()).equals(canon)) {
                cauri.setForceFetch(true);
            }
        }
        return canon;
    }
    
    /**
     * @param cauri CrawlURI we're to get a key for.
     * @return a String token representing a queue
     */
    public String getClassKey(CrawlURI curi) {
        assert KeyedProperties.overridesActiveFrom(curi);      
        String queueKey = getQueueAssignmentPolicy().getClassKey(curi);
        return queueKey;
    }
    
    /**
     * Return the 'cost' of a CrawlURI (how much of its associated
     * queue's budget it depletes upon attempted processing)
     * 
     * @param curi
     * @return the associated cost
     */
    protected int getCost(CrawlURI curi) {
        assert KeyedProperties.overridesActiveFrom(curi);
        
        int cost = curi.getHolderCost();
        if (cost == CrawlURI.UNCALCULATED) {
            cost = getCostAssignmentPolicy().costOf(curi);
        }
        return cost;
    }

这些方法涉及相应的策略类,这个话题比较大,留在后面的文章再解析吧

Preselector类用来配置正则过滤CrawlURI curi对象

@Override
    protected ProcessResult innerProcessResult(CrawlURI puri) {
        CrawlURI curi = (CrawlURI)puri;
        
        // Check if uris should be blocked
        if (getBlockAll()) {
            curi.setFetchStatus(S_BLOCKED_BY_USER);
            return ProcessResult.FINISH;
        }

        // Check if allowed by regular expression
        String regex = getAllowByRegex();
        if (regex != null && !regex.equals("")) {
            if (!TextUtils.matches(regex, curi.toString())) {
                curi.setFetchStatus(S_BLOCKED_BY_USER);
                return ProcessResult.FINISH;
            }
        }

        // Check if blocked by regular expression
        regex = getBlockByRegex();
        if (regex != null && !regex.equals("")) {
            if (TextUtils.matches(regex, curi.toString())) {
                curi.setFetchStatus(S_BLOCKED_BY_USER);
                return ProcessResult.FINISH;
            }
        }

        // Possibly recheck scope
        if (getRecheckScope()) {
            if (!isInScope(curi)) {
                // Scope rejected
                curi.setFetchStatus(S_OUT_OF_SCOPE);
                return ProcessResult.FINISH;
            }
        }
        
        return ProcessResult.PROCEED;
    }

对应的配置文件crawler-beans.cxml里面的配置示例如下

 <!-- FETCH CHAIN --> 
 <!-- first, processors are declared as top-level named beans -->
 <bean id="preselector" class="org.archive.crawler.prefetch.Preselector">
      <!-- <property name="recheckScope" value="false" />-->
     <!--  <property name="blockAll" value="false" />-->
     <!--  <property name="blockByRegex" value="" />-->
     <!--  <property name="allowByRegex" value="" />-->
 </bean>

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3037360.html

原文地址:https://www.cnblogs.com/chenying99/p/3037360.html