Heritrix 3.1.0 源码解析(十五)

本文分析Heritrix3.1.0系统里面的WorkQueue队列(具体是BdbWorkQueue)的调度机制,这部分是系统里面比较复杂的,我只能是尝试分析(本文可能会修改)

我在Heritrix 3.1.0 源码解析(六)一文中涉及BdbFrontier对象的初始化,现在回顾一下

我们看到在WorkQueueFrontier类中的初始化方法void start()里面进一步调用了void initInternalQueues()方法

而void initInternalQueues()方法 里面进一步调用子类BdbFrontier的void initOtherQueues()方法与void initAllQueues()方法(父类为抽象方法)

@Override
    protected void initOtherQueues() throws DatabaseException {
        boolean recycle = (recoveryCheckpoint != null);
        
        // tiny risk of OutOfMemoryError: if giant number of snoozed
        // queues all wake-to-ready at once
        readyClassQueues = new LinkedBlockingQueue<String>();

        inactiveQueuesByPrecedence = new ConcurrentSkipListMap<Integer,Queue<String>>();
        
        retiredQueues = bdb.getStoredQueue("retiredQueues", String.class, recycle);

        // primary snoozed queues
        snoozedClassQueues = new DelayQueue<DelayedWorkQueue>();
        // just in case: overflow for extreme situations
        snoozedOverflow = bdb.getStoredMap(
                "snoozedOverflow", Long.class, DelayedWorkQueue.class, true, false);
            
        this.futureUris = bdb.getStoredMap(
                "futureUris", Long.class, CrawlURI.class, true, recoveryCheckpoint!=null);
        
        // initialize master map in which other queues live
        this.pendingUris = createMultipleWorkQueues();
    }

上面方法主要是初始化队列,这里解释一下:

readyClassQueues存储着已经准备好被爬取的队列的key;[Queue类型]

inactiveQueuesByPrecedence用Map类型存储着优先级存与非活动状态的队列(队列存储着key);[Map类型]

retiredQueues存储着不再激活的url队列的key;[Queue类型]

snoozedClassQueues存储着所有休眠的url队列的key,它们都按唤醒时间排序;[Queue类型]

snoozedOverflow用Map类型存储着休眠到期时间与过载的休眠状态的队列(队列存储着key)[Map类型]

futureUris用Map类型存储着调度时间与CrawlURI对象[Map类型]

这里我们需要注意的是snoozedClassQueues队列的类型DelayQueue<DelayedWorkQueue>,用于放置实现了Delayed接口的对象,其中的对象只能在其到期时才能从队列中取走。这种队列是有序的,即队头对象的延迟到期时间最长

DelayedWorkQueue类的源码如下

/**
 * A named WorkQueue wrapped with a wake time, perhaps referenced only
 * by name. 
 * 
 * @contributor gojomo
 */
class DelayedWorkQueue implements Delayed, Serializable {
    private static final long serialVersionUID = 1L;

    public String classKey;
    public long wakeTime;
    
    /**
     * Reference to the WorkQueue, perhaps saving a deserialization
     * from allQueues.
     */
    protected transient WorkQueue workQueue;
    
    public DelayedWorkQueue(WorkQueue queue) {
        this.classKey = queue.getClassKey();
        this.wakeTime = queue.getWakeTime();
        this.workQueue = queue;
    }
    
    // TODO: consider if this should be method on WorkQueueFrontier
    public WorkQueue getWorkQueue(WorkQueueFrontier wqf) {
        if (workQueue == null) {
            // This is a recently deserialized DelayedWorkQueue instance
            WorkQueue result = wqf.getQueueFor(classKey);
            this.workQueue = result;
        }
        return workQueue;
    }

    public long getDelay(TimeUnit unit) {
        return unit.convert(
                wakeTime - System.currentTimeMillis(),
                TimeUnit.MILLISECONDS);
    }
    
    public String getClassKey() {
        return classKey;
    }
    
    public long getWakeTime() {
        return wakeTime;
    }
    
    public void setWakeTime(long time) {
        this.wakeTime = time;
    }
    
    public int compareTo(Delayed obj) {
        if (this == obj) {
            return 0; // for exact identity only
        }
        DelayedWorkQueue other = (DelayedWorkQueue) obj;
        if (wakeTime > other.getWakeTime()) {
            return 1;
        }
        if (wakeTime < other.getWakeTime()) {
            return -1;
        }
        // at this point, the ordering is arbitrary, but still
        // must be consistent/stable over time
        return this.classKey.compareTo(other.getClassKey());        
    }
    
}

该对象必须实现long getDelay(TimeUnit unit) 方法和int compareTo(Delayed obj)方法,用于队列的排序(我们可以看到,DelayedWorkQueue对象是对WorkQueue queue对象的封装,里面按WorkQueue queue设置的延迟时间排序)

@Override
    protected void initAllQueues() throws DatabaseException {
        boolean isRecovery = (recoveryCheckpoint != null);
        this.allQueues = bdb.getObjectCache("allqueues", isRecovery, WorkQueue.class, BdbWorkQueue.class);
        //后面部分的代码略
    }

上面方法主要是初始化ObjectIdentityCache<WorkQueue> allQueues变量,可以理解为BdbWorkQueue队列工厂

接下来分析与BdbFrontier对象void schedule(CrawlURI curi)方法相关的方法

/**
     * Send a CrawlURI to the appropriate subqueue.
     * 
     * @param curi
     */
    protected void sendToQueue(CrawlURI curi) {
//        assert Thread.currentThread() == managerThread;
        
        WorkQueue wq = getQueueFor(curi.getClassKey());
        synchronized(wq) {
            int originalPrecedence = wq.getPrecedence();
            wq.enqueue(this, curi);
            // always take budgeting values from current curi
            // (whose overlay settings should be active here)
            wq.setSessionBudget(getBalanceReplenishAmount());
            wq.setTotalBudget(getQueueTotalBudget());
            
            if(!wq.isRetired()) {
                incrementQueuedUriCount();
                int currentPrecedence = wq.getPrecedence();
                if(!wq.isManaged() || currentPrecedence < originalPrecedence) {
                    // queue newly filled or bumped up in precedence; ensure enqueuing
                    // at precedence level (perhaps duplicate; if so that's handled elsewhere)
                    deactivateQueue(wq);
                }
            }
        }
        // Update recovery log.
        doJournalAdded(curi);
        wq.makeDirty();
        largestQueues.update(wq.getClassKey(), wq.getCount());
    }

首先是根据classkey从ObjectIdentityCache<WorkQueue> allQueues里面获取BdbWorkQueue队列工厂,WorkQueue getQueueFor(final String classKey) 方法在BdbFrontier类里面

/**
     * Return the work queue for the given classKey, or null
     * if no such queue exists.
     * 
     * @param classKey key to look for
     * @return the found WorkQueue
     */
    protected WorkQueue getQueueFor(final String classKey) {      
        WorkQueue wq = allQueues.getOrUse(
                classKey,
                new Supplier<WorkQueue>() {
                    public BdbWorkQueue get() {
                        String qKey = new String(classKey); // ensure private minimal key
                        BdbWorkQueue q = new BdbWorkQueue(qKey, BdbFrontier.this);
                        q.setTotalBudget(getQueueTotalBudget()); 
                        getQueuePrecedencePolicy().queueCreated(q);
                        return q;
                    }});
        return wq;
    }

在初始化对应classkey的BdbWorkQueue对象同时,设置long totalBudget成员和 PrecedenceProvider precedenceProvider成员属性值

再接着void sendToQueue(CrawlURI curi)方法分析,后面部分是先锁定WorkQueue wq对象(防止多线程同时写入),写入BDB数据库,设置属性int sessionBudget  long totalBudget

如果队列不是移除的队列,再判断该队列是否在生命周期内,如果不在生命周期或者活动队列的数量超过设定的阈值(currentPrecedence < originalPrecedence),将指定队列归入非活动状态队列 (重置highestPrecedenceWaiting值 非活动状态队列里面的precedence最小值)

    /**
     * Put the given queue on the inactiveQueues queue
     * @param wq
     */
    protected void deactivateQueue(WorkQueue wq) {
        int precedence = wq.getPrecedence();

        synchronized(wq) {
            wq.noteDeactivated();//active = false; 活动状态   isManaged = true; 被管理
            inProcessQueues.remove(wq);//从进程中的队列移除该队列
            if(wq.getCount()==0) {
                System.err.println("deactivate empty queue?");
            }

            synchronized (getInactiveQueuesByPrecedence()) {
                getInactiveQueuesForPrecedence(precedence).add(wq.getClassKey());
                if(wq.getPrecedence() < highestPrecedenceWaiting ) {
                    highestPrecedenceWaiting = wq.getPrecedence();
                }
            }

            if(logger.isLoggable(Level.FINE)) {
                logger.log(Level.FINE,
                        "queue deactivated to p" + precedence 
                        + ": " + wq.getClassKey());
            }
        }
    }

getInactiveQueuesByPrecedence()方法是获取用Map类型存储着优先级存与非活动状态的队列(队列存储着key);[Map类型]

getInactiveQueuesForPrecedence(precedence)方法是按指定优先级获取非活动状态的队列(如果没有则创建),然后将该队列的classkey添加到该非活动状态的队列里面 

 /**
     * 按指定优先级获取非活动状态的队列
     * Get the queue of inactive uri-queue names at the given precedence. 
     * 
     * @param precedence
     * @return queue of inacti
     */
    protected Queue<String> getInactiveQueuesForPrecedence(int precedence) {
        //优先级 /非活动状态的队列的map容器
        Map<Integer,Queue<String>> inactiveQueuesByPrecedence = 
            getInactiveQueuesByPrecedence();
        Queue<String> candidate = inactiveQueuesByPrecedence.get(precedence);
        if(candidate==null) {
            candidate = createInactiveQueueForPrecedence(precedence);
            inactiveQueuesByPrecedence.put(precedence,candidate);
        }
        return candidate;
    }

相关方法在其子类BdbFrontier里面

/* (non-Javadoc)
     * 创建非活动状态的队列
     * @see org.archive.crawler.frontier.WorkQueueFrontier#createInactiveQueueForPrecedence(int)
     */
    @Override
    Queue<String> createInactiveQueueForPrecedence(int precedence) {
        return createInactiveQueueForPrecedence(precedence, false);
    }
    
    /** 
     * inactiveQueues存储着所有非活动状态的url队列的key;
     * Optionally reuse prior data, for use when resuming from a checkpoint
     */
    Queue<String> createInactiveQueueForPrecedence(int precedence, boolean usePriorData) {
        return bdb.getStoredQueue("inactiveQueues-"+precedence, String.class, usePriorData);
    }

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/21/3033437.html

原文地址:https://www.cnblogs.com/chenying99/p/3033437.html