Heritrix 3.1.0 源码解析(十)

本文要分析的是Heritrix3.1.0的Frontier组件,先熟悉一下相关的UML类图

通过浏览该图,我们可以清楚的看出Frontier组件的相关接口和类的继承和调用关系,不必我再文字描述了 

在分析BdbFrontier类的相关方法之前,有必要熟悉一下在系统环境中该对象的初始状态,以及状态的改变

这部分内容需要回顾 Heritrix 3.1.0 源码解析(六)一文,本人就不在这里重复了 

BdbFrontier类重要的有几个方法,分别是添加CrawlURI,获取CrawlURI,完成CrawlURI

void schedule(CrawlURI curi)

CrawlURI next()

void finished(CrawlURI curi) 

下面首先来分析void schedule(CrawlURI curi)方法,在它的父类WorkQueueFrontier里面 

org.archive.crawler.frontier.BdbFrontier

    org.archive.crawler.frontier.WorkQueueFrontier

/**
     * Arrange for the given CrawlURI to be visited, if it is not
     * already enqueued/completed. 
     * 
     * Differs from superclass in that it operates in calling thread, rather 
     * than deferring operations via in-queue to managerThread. TODO: settle
     * on either defer or in-thread approach after testing. 
     *
     * @see org.archive.crawler.framework.Frontier#schedule(org.archive.modules.CrawlURI)
     */
    @Override
    public void schedule(CrawlURI curi) {
        sheetOverlaysManager.applyOverlaysTo(curi);
        try {
            KeyedProperties.loadOverridesFrom(curi);
            if(curi.getClassKey()==null) {
                // remedial processing
                preparer.prepare(curi);
            }
            processScheduleIfUnique(curi);
        } finally {
            KeyedProperties.clearOverridesFrom(curi); 
        }
    }

org.archive.crawler.frontier.BdbFrontier

    org.archive.crawler.frontier.WorkQueueFrontier

 /**
     * Arrange for the given CrawlURI to be visited, if it is not
     * already scheduled/completed.
     *
     * @see org.archive.crawler.framework.Frontier#schedule(org.archive.modules.CrawlURI)
     */
    protected void processScheduleIfUnique(CrawlURI curi) {
//        assert Thread.currentThread() == managerThread;
        assert KeyedProperties.overridesActiveFrom(curi); 
        
        // Canonicalization may set forceFetch flag.  See
        // #canonicalization(CrawlURI) javadoc for circumstance.
        String canon = curi.getCanonicalString();
        if (curi.forceFetch()) {
            uriUniqFilter.addForce(canon, curi);
        } else {
            uriUniqFilter.add(canon, curi);
        }
    }

void processScheduleIfUnique(CrawlURI curi)方法调用了BdbUriUniqFilter类成员的void add(String key, CrawlURI value)方法

在分析BdbUriUniqFilter类成员的void add(String key, CrawlURI value)方法前,首先有必要查看一下该对象的状态

org.archive.crawler.util.BdbUriUniqFilter类在它的初始化方法里面初始化下面成员变量,前者是BDB数据库,后者是BDB数据库的数据对象(在BdbUriUniqFilter类方法里面我还没找发现该变量的作用)

protected transient Database alreadySeen = null;
protected transient DatabaseEntry value = null;

public void start() {
        if(isRunning()) {
            return; 
        }
        boolean isRecovery = (recoveryCheckpoint != null);
        try {
            BdbModule.BdbConfig config = getDatabaseConfig();
            config.setAllowCreate(!isRecovery);
            initialize(bdb.openDatabase(DB_NAME, config, isRecovery));
        } catch (DatabaseException e) {
            throw new IllegalStateException(e);
        }
        if(isRecovery) {
            JSONObject json = recoveryCheckpoint.loadJson(beanName);
            try {
                count.set(json.getLong("count"));
            } catch (JSONException e) {
                throw new RuntimeException(e);
            }           
        }
        isRunning = true; 
    }

 进一步调用 void initialize(Database db)

/**
     * Method shared by constructors.
     * @param env Environment to use.
     * @throws DatabaseException
     */
    protected void initialize(Database db) throws DatabaseException {
        open(db);
    }

初始化数据库Database alreadySeen,void open(final Database db)

protected void open(final Database db)
    throws DatabaseException {
        this.alreadySeen = db;
        this.value = new DatabaseEntry("".getBytes());
    }

另外在BdbFrontier对象的初始化方法里面我们发现执行了BdbUriUniqFilter类成员的void setDestination(CrawlUriReceiver receiver)方法,这里是设置BdbFrontier对象本身,在BdbUriUniqFilter对象的相关方法里面调用了BdbFrontier对象的void receive(CrawlURI item)方法(BdbFrontier类实现了CrawlUriReceiver接口),这里类似于回调接口

org.archive.crawler.frontier.BdbFrontier

     org.archive.crawler.frontier.WorkQueueFrontier

public void start() {
        if(isRunning()) {
            return; 
        }
        uriUniqFilter.setDestination(this);
        super.start();
        try {
            initInternalQueues();
        } catch (Exception e) {
            throw new IllegalStateException(e);
        }

    }

BdbUriUniqFilter对象的状态分析完毕,现在我们再来分析上面uriUniqFilter.add(canon, curi)方法,在BdbUriUniqFilter的父类里面

org.archive.crawler.util.BdbUriUniqFilter

     org.archive.crawler.util.SetBasedUriUniqFilter

public void add(String key, CrawlURI value) {
        profileLog(key);
        if (setAdd(key)) {
            this.receiver.receive(value);
            if (setCount() % 50000 == 0) {
                LOGGER.log(Level.FINE, "count: " + setCount() + " totalDups: "
                        + duplicateCount + " recentDups: "
                        + (duplicateCount - duplicatesAtLastSample));
                duplicatesAtLastSample = duplicateCount;
            }
        } else {
            duplicateCount++;
        }
    }

进一步调用BdbUriUniqFilter本身的boolean setAdd(CharSequence uri)方法

org.archive.crawler.util.BdbUriUniqFilter 

protected boolean setAdd(CharSequence uri) {
        DatabaseEntry key = new DatabaseEntry();
        LongBinding.longToEntry(createKey(uri), key);
        long started = 0;
        
        OperationStatus status = null;
        try {
            if (logger.isLoggable(Level.INFO)) {
                started = System.currentTimeMillis();
            }
            status = alreadySeen.putNoOverwrite(null, key, ZERO_LENGTH_ENTRY);
            if (logger.isLoggable(Level.INFO)) {
                aggregatedLookupTime +=
                    (System.currentTimeMillis() - started);
            }
        } catch (DatabaseException e) {
            logger.severe(e.getMessage());
        }
        if (status == OperationStatus.SUCCESS) {
            count.incrementAndGet();
            if (logger.isLoggable(Level.FINE)) {
                final int logAt = 10000;
                if (count.get() > 0 && ((count.get() % logAt) == 0)) {
                    logger.fine("Average lookup " +
                        (aggregatedLookupTime / logAt) + "ms.");
                    aggregatedLookupTime = 0;
                }
            }
        }
        if(status == OperationStatus.KEYEXIST) {
            return false; // not added
        } else {
            return true;
        }
    }

 这里有意思的是,我们可以看到Heritrix是怎么实现URL排重的

**
     * Create fingerprint.
     * Pubic access so test code can access createKey.
     * @param uri URI to fingerprint.
     * @return Fingerprint of passed <code>url</code>.
     */
    public static long createKey(CharSequence uri) {
        String url = uri.toString();
        int index = url.indexOf(COLON_SLASH_SLASH);
        if (index > 0) {
            index = url.indexOf('/', index + COLON_SLASH_SLASH.length());
        }
        CharSequence hostPlusScheme = (index == -1)? url: url.subSequence(0, index);
        long tmp = FPGenerator.std24.fp(hostPlusScheme);
        return tmp | (FPGenerator.std40.fp(url) >>> 24);
    }

这里需要注意的是Database alreadySeen数据库并没用真正存储URL链接地址,而是URL链接的Fingerprint值

BdbUriUniqFilter本身的boolean setAdd(CharSequence uri)方法的真正意义在于判断URL链接是否添加过,除此之外,大概没有其他含义

接下来我们看到void add(String key, CrawlURI value)方法后面调用了this.receiver.receive(value)方法

我们前面分析BdbUriUniqFilter初始化状态的时候已经提到了,BdbUriUniqFilter对象是通过回调BdbFrontier对象的void receive(CrawlURI item)方法(BdbFrontier类实现了CrawlUriReceiver接口)

BdbFrontier类的父类的父类AbstractFrontier我们发现该方法void receive(CrawlURI curi)

 org.archive.crawler.frontier.BdbFrontier

       org.archive.crawler.frontier.AbstractFrontier

/**
     * Accept the given CrawlURI for scheduling, as it has
     * passed the alreadyIncluded filter. 
     * 
     * Choose a per-classKey queue and enqueue it. If this
     * item has made an unready queue ready, place that 
     * queue on the readyClassQueues queue. 
     * @param caUri CrawlURI.
     */
    public void receive(CrawlURI curi) {
        sheetOverlaysManager.applyOverlaysTo(curi);
        // prefer doing asap if already in manager thread
        try {
            KeyedProperties.loadOverridesFrom(curi);
            processScheduleAlways(curi);
        } finally {
            KeyedProperties.clearOverridesFrom(curi); 
        }
    }

继续调用BdbFrontier类的void processScheduleAlways(CrawlURI curi)方法,在它的父类WorkQueueFrontier里面

 org.archive.crawler.frontier.BdbFrontier

        org.archive.crawler.frontier.WorkQueueFrontier 

/**
     * Accept the given CrawlURI for scheduling, as it has
     * passed the alreadyIncluded filter. 
     * 
     * Choose a per-classKey queue and enqueue it. If this
     * item has made an unready queue ready, place that 
     * queue on the readyClassQueues queue. 
     * @param caUri CrawlURI.
     */
    protected void processScheduleAlways(CrawlURI curi) {
//        assert Thread.currentThread() == managerThread;
        assert KeyedProperties.overridesActiveFrom(curi); 
        
        prepForFrontier(curi);
        sendToQueue(curi);
    }

void prepForFrontier(CrawlURI curi)方法是设置curi的序号

void sendToQueue(CrawlURI curi)方法添加CrawlURI curi到工作队列里面(该方法是由BdbFrontier类的void schedule(CrawlURI curi)发起的调用的方法里面的实质性方法),在BdbFrontier类的父类WorkQueueFrontier里面

 org.archive.crawler.frontier.BdbFrontier

          org.archive.crawler.frontier.WorkQueueFrontier

/**
     * Send a CrawlURI to the appropriate subqueue.
     * 
     * @param curi
     */
    protected void sendToQueue(CrawlURI curi) {
//        assert Thread.currentThread() == managerThread;
        
        WorkQueue wq = getQueueFor(curi.getClassKey());
        synchronized(wq) {
            int originalPrecedence = wq.getPrecedence();
            wq.enqueue(this, curi);
            // always take budgeting values from current curi
            // (whose overlay settings should be active here)
            wq.setSessionBudget(getBalanceReplenishAmount());
            wq.setTotalBudget(getQueueTotalBudget());
            
            if(!wq.isRetired()) {
                incrementQueuedUriCount();
                int currentPrecedence = wq.getPrecedence();
                if(!wq.isManaged() || currentPrecedence < originalPrecedence) {
                    // queue newly filled or bumped up in precedence; ensure enqueuing
                    // at precedence level (perhaps duplicate; if so that's handled elsewhere)
                    deactivateQueue(wq);
                }
            }
        }
        // Update recovery log.
        doJournalAdded(curi);
        wq.makeDirty();
        largestQueues.update(wq.getClassKey(), wq.getCount());
    }

void sendToQueue(CrawlURI curi)方法首先是根据CrawlURI curi的ClassKey得到工作队列,这里是BdbWorkQueue(后面还设置了BdbWorkQueue对象的相关属性,涉及到Heritrix3.1.0工作队列的调度,后文再分析),

然后调用它的long enqueue(final WorkQueueFrontier frontier,CrawlURI curi)方法,在BdbWorkQueue类的父类WorkQueue里面

org.archive.crawler.frontier.BdbWorkQueue

         org.archive.crawler.frontier.WorkQueue

/**
     * Add the given CrawlURI, noting its addition in running count. (It
     * should not already be present.)
     * 
     * @param frontier Work queues manager.
     * @param curi CrawlURI to insert.
     */
    protected synchronized long enqueue(final WorkQueueFrontier frontier,
        CrawlURI curi) {
        try {
            insert(frontier, curi, false);
        } catch (IOException e) {
            //FIXME better exception handling
            e.printStackTrace();
            throw new RuntimeException(e);
        }
        count++;
        enqueueCount++;
        return count;
    }

继续调用BdbWorkQueue对象的void insert(final WorkQueueFrontier frontier, CrawlURI curi,boolean overwriteIfPresent)方法,也在在BdbWorkQueue类的父类WorkQueue里面

org.archive.crawler.frontier.BdbWorkQueue

      org.archive.crawler.frontier.WorkQueue

/**
     * Insert the given curi, whether it is already present or not. 
     * @param frontier WorkQueueFrontier.
     * @param curi CrawlURI to insert.
     * @throws IOException
     */
    private void insert(final WorkQueueFrontier frontier, CrawlURI curi,
            boolean overwriteIfPresent)
        throws IOException {
        insertItem(frontier, curi, overwriteIfPresent);
        lastQueued = curi.toString();
    }

进一步调用BdbWorkQueue对象的void insertItem(final WorkQueueFrontier frontier, final CrawlURI curi, boolean overwriteIfPresent)方法

org.archive.crawler.frontier.BdbWorkQueue

protected void insertItem(final WorkQueueFrontier frontier,
            final CrawlURI curi, boolean overwriteIfPresent) throws IOException {
        try {
            final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier)
                .getWorkQueues();
            queues.put(curi, overwriteIfPresent);
            if (LOGGER.isLoggable(Level.FINE)) {
                LOGGER.fine("Inserted into " + getPrefixClassKey(this.origin) +
                    " (count " + Long.toString(getCount())+ "): " +
                        curi.toString());
            }
        } catch (DatabaseException e) {
            throw new IOException(e);
        }
    }

我们从这里可以看到,最终是调用BdbFrontier的成员变量BdbMultipleWorkQueues pendingUris的void put(CrawlURI curi, boolean overwriteIfPresent) 方法,将CrawlURI curi放入待采集队列

现在总结一下:

BdbFrontier对象的void schedule(CrawlURI curi)方法首先通过调用BdbUriUniqFilter对象void add(String key, CrawlURI value)方法,

里面先判断当前CrawlURI curi是否已经添加过,然后回调BdbFrontier对象的void receive(CrawlURI curi)方法,

而BdbFrontier对象的void receive(CrawlURI curi)方法:根据当前CrawlURI curi的ClassKey获取相应的BdbWorkQueue对象,然后调用BdbWorkQueue对象的long enqueue(final WorkQueueFrontier frontier,CrawlURI curi)方法,最后调用将CrawlURI curi添加到BdbFrontier的成员变量BdbMultipleWorkQueues pendingUris的void put(CrawlURI curi, boolean overwriteIfPresent) 方法,归入pendingUris队列

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/16/3023659.html

原文地址:https://www.cnblogs.com/chenying99/p/3023659.html