Heritrix 3.1.0 源码解析(二十五)

Heritrix 3.1.0 源码解析(二十三)中我们分析了Heritrix3.1.0系统是怎样扩展HttpClient组件的HttpConnection连接对象和相应的管理接口HttpConnectionManager

HttpConnection连接对象里面创建了SOCKET连接,但是还没用向输出流写数据,也没有从输入流读数据, 这里面HttpClient组件是怎么实现的,Heritrix3.1.0系统又是怎么扩展的呢?

我们知道,当我们用HttpClient组件执行网页请求时,根据我们要请求的网页是GET请求还是POST请求我们创建相应的GetMethod类或PostMethod类(当然还有其他方式,浏览器暂不支持)

这些请求类实现了共同的接口HttpMethod,该接口声明了所有请求需要实现的方法(该接口声明方法比较多,逻辑上可以将它们分为与Request相关部分和与Response相关部分,便于理解),下面列出的是里面的重要方法

public interface HttpMethod {   // ---------------------------------------------------------------- Queries
    //与Response相关部分
    boolean validate();

    int getStatusCode();
   
    byte[] getResponseBody() throws IOException;

    String getResponseBodyAsString() throws IOException;

    InputStream getResponseBodyAsStream() throws IOException;    int execute(HttpState state, HttpConnection connection) 
        throws HttpException, IOException;    void releaseConnection();boolean getDoAuthentication();

    void setDoAuthentication(boolean doAuthentication);

    public HttpMethodParams getParams();

    public void setParams(final HttpMethodParams params);

    public AuthState getHostAuthState();

    public AuthState getProxyAuthState();

    boolean isRequestSent();
}

当我们执行一个请求时,实际会调用接口实现类的execute方法

实现该接口有一个抽象类HttpMethodBase,该抽象类实现了所有继承类(所有请求方式)的共同方法,主要是SOCKET输出流和输入流的处理,其中最重要的是execute方法

/**
     * Executes this method using the specified <code>HttpConnection</code> and
     * <code>HttpState</code>. 
     *
     * @param state {@link HttpState state} information to associate with this
     *        request. Must be non-null.
     * @param conn the {@link HttpConnection connection} to used to execute
     *        this HTTP method. Must be non-null.
     *
     * @return the integer status code if one was obtained, or <tt>-1</tt>
     *
     * @throws IOException if an I/O (transport) error occurs
     * @throws HttpException  if a protocol exception occurs.
     */
    public int execute(HttpState state, HttpConnection conn)
        throws HttpException, IOException {
                
        LOG.trace("enter HttpMethodBase.execute(HttpState, HttpConnection)");

        // this is our connection now, assign it to a local variable so 
        // that it can be released later
        this.responseConnection = conn;

        checkExecuteConditions(state, conn);
        this.statusLine = null;
        this.connectionCloseForced = false;

        conn.setLastResponseInputStream(null);

        // determine the effective protocol version
        if (this.effectiveVersion == null) {
            this.effectiveVersion = this.params.getVersion(); 
        }
        //Socket输出流
        writeRequest(state, conn);
        this.requestSent = true;
        //Socket输入流
        readResponse(state, conn);
        // the method has successfully executed
        used = true; 

        return statusLine.getStatusCode();
    }

上面方法中的writeRequest(state, conn)负责写入流,readResponse(state, conn)负责读取流

writeRequest(state, conn)方法写入流的过程无非是组装数据,Heritrix3.1.0系统就是通过这个入口切入的,并改写了HttpMethodBase类,写入自定义的逻辑,包括cookies的写入和form参数的写入等(这部分待分析HERITRIX3.1.0系统的自定义cookies和form封装再分析吧)

该方法除了执行上述公用的逻辑外,还继续调用了boolean writeRequestBody(HttpState state, HttpConnection conn)方法,该方法通常由子类实现

该抽象类HttpMethodBase的继承类提供对应请求方式的自身方法实现,我这里只分析Heritrix3.1.0系统自定义的HttpRecorderGetMethod类和HttpRecorderPostMethod类

public class HttpRecorderGetMethod extends GetMethod {
    
    protected static Logger logger =
        Logger.getLogger(HttpRecorderGetMethod.class.getName());
    
    /**
     * Instance of http recorder method.
     */
    protected HttpRecorderMethod httpRecorderMethod = null;
    

    public HttpRecorderGetMethod(String uri, Recorder recorder) {
        super(uri);
        this.httpRecorderMethod = new HttpRecorderMethod(recorder);
    }

    protected void readResponseBody(HttpState state, HttpConnection connection)
    throws IOException, HttpException {
        // We're about to read the body.  Mark transition in http recorder.
        this.httpRecorderMethod.markContentBegin(connection);
        super.readResponseBody(state, connection);
    }

    protected boolean shouldCloseConnection(HttpConnection conn) {
        // Always close connection after each request. As best I can tell, this
        // is superfluous -- we've set our client to be HTTP/1.0.  Doing this
        // out of paranoia.
        return true;
    }

    public int execute(HttpState state, HttpConnection conn)
    throws HttpException, IOException {
        // Save off the connection so we can close it on our way out in case
        // httpclient fails to (We're not supposed to have access to the
        // underlying connection object; am only violating contract because
        // see cases where httpclient is skipping out w/o cleaning up
        // after itself).
        this.httpRecorderMethod.setConnection(conn);
        return super.execute(state, conn);
    }
    
    protected void addProxyConnectionHeader(HttpState state, HttpConnection conn)
            throws IOException, HttpException {
        super.addProxyConnectionHeader(state, conn);
        this.httpRecorderMethod.handleAddProxyConnectionHeader(this);
    }
}

该类的构造方法除了传入URL字符串外,还包括Recorder recorder对象用于初始化成员对象HttpRecorderMethod httpRecorderMethod,该对象包含两个成员Recorder httpRecorder对象和HttpConnection connection对象,在HttpRecorderPostMethod类的相关方法里面,除了调用父类的同名方法外,就是调用HttpRecorderMethod httpRecorderMethod对象的相关方法,包括设置自身的HttpConnection connection成员对象和回调Recorder httpRecorder对象方法(输入流的预备工作)

HttpRecorderPostMethod类继承自PostMethod类,与HttpRecorderGetMethod类的基本逻辑很类似,我就不再分析了

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/28/3048387.html

原文地址:https://www.cnblogs.com/chenying99/p/3048387.html