Heritrix 3.1.0 源码解析(二十一)

上文中的抽象类Scoper关联到另外一个成员变量DecideRule scope,我不得不先中断处理器类的分析(后面再继续处理器分析),来插叙一下DecideRule scope对象,我说了,DecideRule scope成员是用来控制CrawlURI caUri对象的范围

照例先来浏览一下DecideRule相关类图

DecideRule类是一个抽象类,用来判断一个CrawlURI caUri对象是接受还是拒绝

public DecideResult decisionFor(CrawlURI uri) {
        if (!getEnabled()) {
            return DecideResult.NONE;
        }
        DecideResult result = innerDecide(uri);
        if (result == DecideResult.NONE) {
            return result;
        }

        return result;
    }
    
    
    protected abstract DecideResult innerDecide(CrawlURI uri);
    
    
    public DecideResult onlyDecision(CrawlURI uri) {
        return null;
    }

    public boolean accepts(CrawlURI uri) {
        return DecideResult.ACCEPT == decisionFor(uri);
    }

上面抽象方法由子类DecideResult innerDecide(CrawlURI uri)实现

DecideResult为枚举类,其值有三

/**
 * The decision of a DecideRule.
 * 
 * @author pjack
 */
public enum DecideResult {

    /** Indicates the URI was accepted. */
    ACCEPT, 
    
    /** Indicates the URI was neither accepted nor rejected. */
    NONE, 
    
    /** Indicates the URI was rejected. */
    REJECT;

    
    public static DecideResult invert(DecideResult result) {
        switch (result) {
            case ACCEPT:
                return REJECT;
            case REJECT:
                return ACCEPT;
            default:
                return result;
        }
    }
}

我们再来看它的重要子类DecideRuleSequence,该类拥有DecideRule聚集,DecideResult innerDecide(CrawlURI uri)方法里面迭代调用聚集元素的DecideResult decisionFor(CrawlURI uri)方法(composite模式与Iterator模式结合)

@SuppressWarnings("unchecked")
    public List<DecideRule> getRules() {
        return (List<DecideRule>) kp.get("rules");
    }
    public void setRules(List<DecideRule> rules) {
        kp.put("rules", rules);
    }

    public DecideResult innerDecide(CrawlURI uri) {
        DecideRule decisiveRule = null;
        int decisiveRuleNumber = -1;
        DecideResult result = DecideResult.NONE;
        List<DecideRule> rules = getRules();
        int max = rules.size();
        
        for (int i = 0; i < max; i++) {
            DecideRule rule = rules.get(i);
            if (rule.onlyDecision(uri) != result) {
                DecideResult r = rule.decisionFor(uri);
                if (LOGGER.isLoggable(Level.FINEST)) {
                    LOGGER.finest("DecideRule #" + i + " " + 
                            rule.getClass().getName() + " returned " + r + " for url: " + uri);
                }
                if (r != DecideResult.NONE) {
                    result = r;
                    decisiveRule = rule;
                    decisiveRuleNumber = i;
                }
            }
        }

        if (fileLogger != null) {
            fileLogger.info(decisiveRuleNumber + " " + decisiveRule.getClass().getSimpleName() + " " + result + " " + uri);
        }

        return result;
    }

运行环境中该聚集元素我们可以通过crawler-beans.cxml配置文件看到

<!-- SCOPE: rules for which discovered URIs to crawl; order is very 
      important because last decision returned other than 'NONE' wins. -->
 <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
  <!-- <property name="logToFile" value="false" /> -->
  <property name="rules">
   <list>
    <!-- Begin by REJECTing all... -->
    <bean class="org.archive.modules.deciderules.RejectDecideRule">
    </bean>
    <!-- ...then ACCEPT those within configured/seed-implied SURT prefixes... -->
    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
     <!-- <property name="seedsAsSurtPrefixes" value="true" /> -->
     <!-- <property name="alsoCheckVia" value="false" /> -->
     <!-- <property name="surtsSourceFile" value="" /> -->
     <!-- <property name="surtsDumpFile" value="${launchId}/surts.dump" /> -->
     <!-- <property name="surtsSource">
           <bean class="org.archive.spring.ConfigString">
            <property name="value">
             <value>
              # example.com
              # http://www.example.edu/path1/
              # +http://(org,example,
             </value>
            </property> 
           </bean>
          </property> -->
    </bean>
    <!-- ...but REJECT those more than a configured link-hop-count from start... -->
    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
     <!-- <property name="maxHops" value="20" /> -->
    </bean>
    <!-- ...but ACCEPT those more than a configured link-hop-count from start... -->
    <bean class="org.archive.modules.deciderules.TransclusionDecideRule">
     <!-- <property name="maxTransHops" value="2" /> -->
     <!-- <property name="maxSpeculativeHops" value="1" /> -->
    </bean>
    <!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... -->
    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
          <property name="decision" value="REJECT"/>
          <property name="seedsAsSurtPrefixes" value="false"/>
          <property name="surtsDumpFile" value="${launchId}/negative-surts.dump" /> 
     <!-- <property name="surtsSource">
           <bean class="org.archive.spring.ConfigFile">
            <property name="path" value="negative-surts.txt" />
           </bean>
          </property> -->
    </bean>
    <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
    <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
          <property name="decision" value="REJECT"/>
     <!-- <property name="listLogicalOr" value="true" /> -->
     <!-- <property name="regexList">
           <list>
           </list>
          </property> -->
    </bean>
    <!-- ...and REJECT those with suspicious repeating path-segments... -->
    <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
     <!-- <property name="maxRepetitions" value="2" /> -->
    </bean>
    <!-- ...and REJECT those with more than threshold number of path-segments... -->
    <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
     <!-- <property name="maxPathDepth" value="20" /> -->
    </bean>
    <!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
    <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
    </bean>
    <!-- ...but always REJECT those with unsupported URI schemes -->
    <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
    </bean>
   </list>
  </property>
 </bean>

抽象类PredicatedDecideRule继承自DecideRule类

 @Override
    protected DecideResult innerDecide(CrawlURI uri) {
        if (evaluate(uri)) {
            return getDecision();
        }
        return DecideResult.NONE;
    }

    protected abstract boolean evaluate(CrawlURI object);

boolean evaluate(CrawlURI object)方法由子类实现

其他相关实现类我不再一一介绍了

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3037547.html

原文地址:https://www.cnblogs.com/chenying99/p/3037547.html