HttpClient对URL编码的处理方式解惑！

HttpClient对URL编码的处理方式解惑！ - Oracle Linux Web - 51CTO技术博客

HttpClient对URL编码的处理方式解惑！
2011-08-04 21:11:47
标签：URL HttpClient
HttpClient是Apache基金下jakarta commons项目中的一个小项目，该项目封装了对远程地址下载的一些功能，最新版本为3.0。该项目地址：http://jakarta.apache.org/commons/httpclient
最近在编写Spider的时候就用到了HttpClient。在使用过程中发现一个有趣现象：有些URL的编码方式是utf-8，有些URL的编码方式是gbk。他总能够正确识别，但是有些他又不能识别(抓取回来后是乱码)。调用的是：httpMethod.getResponseBodyAsString(); 方法。
在进行进一步分析时，发现他对在http头信息中有charset描述的就正确正常识别。如：
HTTP/1.1 200 OK

Connection: close

Content-Type: text/html; charset=utf-8

Set-Cookie: _session_id=066875c3c0530c06c0204b96db403560; domain=javaeye.com; path=/

Vary: Accept-Encoding

Cache-Control: no-cache

Content-Encoding: gzip

Content-Length: 8512

Date: Fri, 16 Mar 2007 09:02:52 GMT

Server: lighttpd/1.4.13
而没有charset描述信息时，就会是乱码。再查看相关文档时，可以指定URL的编码方式。如：HttpClientParams.setContentCharset("gbk");，指定了编码后，就能够正确识别对应编码的URL了。问题出现了，因URL编码不一样，Spider不可能把URL的编码方式写死。并且只有在抓取回来后才知道编码是否正确。于是再仔细研究一下httpclient的源代码，发现他使用编码的顺序是：http头信息的charset，如果头信息中没有charset，则查找HttpClientParams的contentCharset，如果没有指定编码，则是ISO-8859-1。
/**    
   * Returns the character set from the Content-Type header.    
   *     
   * @param contentheader The content header.    
   * @return String The character set.    
   */    
protected String getContentCharSet(Header contentheader) {     
      LOG.trace("enter getContentCharSet( Header contentheader )");     
      String charset = null;     
      if (contentheader != null) {     
          HeaderElement values[] = contentheader.getElements();     
          // I expect only one header element to be there     
          // No more. no less     
          if (values.length == 1) {     
              NameValuePair param = values[0].getParameterByName("charset");     
              if (param != null) {     
                  // If I get anything "funny"      
                  // UnsupportedEncondingException will result     
                  charset = param.getValue();     
              }     
          }     
      }     
      if (charset == null) {     
          charset = getParams().getContentCharset();     
          if (LOG.isDebugEnabled()) {     
              LOG.debug("Default charset used: " + charset);     
          }     
      }     
      return charset;     
}     
    
    
/**      
* Returns the default charset to be used for writing content body,       
* when no charset explicitly specified.      
* @return The charset      
*/       
public String getContentCharset() {        
     String charset = (String) getParameter(HTTP_CONTENT_CHARSET);        
     if (charset == null) {        
         LOG.warn("Default content charset not configured, using ISO-8859-1");        
         charset = "ISO-8859-1";        
     }        
     return charset;        
}       
这个该死的iso-8859-1害了多少人啊(Tomcat对提交的数据处理默认也是iso-8859-1)！！
经过仔细思考后，决定httpclient再封装一次，思路如下：
先不设定HttpClientParams的charset；

executemethod后，再检查http头信息中的charset是否存在；

如果charset存在，返回httpMethod.getResponseBodyAsString(); ；

如果charset不存在，则先调用httpMethod.getResponseBodyAsString();得到html后，再分析html head的meta的charset <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">；

从meta中分析出charset后，设置到HttpClientParams的contentCharset；

再调用httpMethod.getResponseBodyAsString()，并返回该值。

经过以上思路处理后，发现抓回来的URL再也没有乱码了。爽！
以上步骤中，就是第四步稍微麻烦一些，不过，也可以利用第三方的html paser工具来分析meta的charset！