搞清tomcat中的编解码

http://www.xuebuyuan.com/1287083.html

***********************************

经常会被乱码问题搅得头晕脑胀。事实上，乱码问题涉及的地方比较多，所以常常有了问题也很难定位，比如，可以发生在容器，可以发生在MVC框架，可以发生在数据库，可以发生在响应等等。

这里分析一下tomcat中是如何编解码的。

以"http://localhost:8080/测试?网络=编程"为例，可以将tomcat中编解码分解为这么几个地方：

1. pathInfo.即“测试”这个部分

2. queryParameter，即”网络=编程“这个部分

3. http header，即浏览器发送的http头部分

4. requestBody，http正文部分，即post的正文部分

1. pathInfo，Http11Processor中的process方法会调用InternelInputBuffer来解析请求 URL(inputBuffer.parseRequestLine)以及请求头(inputBuffer.parseHeaders)，但是这里并不是解码的地方。

public void process(Socket theSocket)
        throws IOException {
        ...
                inputBuffer.parseRequestLine();
                request.setStartTime(System.currentTimeMillis());
                keptAlive = true;
                if (disableUploadTimeout) {
                    socket.setSoTimeout(soTimeout);
                } else {
                    socket.setSoTimeout(timeout);
                }
                // Set this every time in case limit has been changed via JMX
                request.getMimeHeaders().setLimit(endpoint.getMaxHeaderCount());
                inputBuffer.parseHeaders();
        ...
    }

真正解码的地方是CoyoteAdapter的convertURI

protected void convertURI(MessageBytes uri, Request request) 
        throws Exception {

        ByteChunk bc = uri.getByteChunk();
        int length = bc.getLength();
        CharChunk cc = uri.getCharChunk();
        cc.allocate(length, -1);

        String enc = connector.getURIEncoding();
        if (enc != null) {
            B2CConverter conv = request.getURIConverter();
            try {
                if (conv == null) {
                    conv = new B2CConverter(enc);
                    request.setURIConverter(conv);
                }
            } catch (IOException e) {
                // Ignore
                log.error("Invalid URI encoding; using HTTP default");
                connector.setURIEncoding(null);
            }
            if (conv != null) {
                try {
                    conv.convert(bc, cc);
                    uri.setChars(cc.getBuffer(), cc.getStart(), 
                                 cc.getLength());
                    return;
                } catch (IOException e) {
                    log.error("Invalid URI character encoding; trying ascii");
                    cc.recycle();
                }
            }
        }

        // Default encoding: fast conversion
        byte[] bbuf = bc.getBuffer();
        char[] cbuf = cc.getBuffer();
        int start = bc.getStart();
        for (int i = 0; i < length; i++) {
            cbuf[i] = (char) (bbuf[i + start] & 0xff);
        }
        uri.setChars(cbuf, 0, length);

    }

而这里的解码使用的是connector的URIEncoding，所以pathInfo的解码可以通过配置server.xml中的URIEncoding来改变。

2. queryParameter部分，这里其实有几个地方可以控制，首先，我们还是找到解码queryParameter的地方。在调用 request.getParameter时最终会调用到coyote内部的Parameter中的handleQueryParameters方法，可以看到这里的queryStringEncoding。

public void handleQueryParameters() {
        if( didQueryParameters ) return;

        didQueryParameters=true;

        if( queryMB==null || queryMB.isNull() )
            return;
        
        if(log.isDebugEnabled()) {
            log.debug("Decoding query " + decodedQuery + " " +
                    queryStringEncoding);
        }

        try {
            decodedQuery.duplicate( queryMB );
        } catch (IOException e) {
            // Can't happen, as decodedQuery can't overflow
            e.printStackTrace();
        }
        processParameters( decodedQuery, queryStringEncoding );
    }

queryStringEncoding是由什么地方决定的呢？事实上，有几个地方决定。第一个是CoyoteAdapter中的service方法，另外就是FormAuthenticator，这两个地方都使用了connector.getURIEncoding()。

public void service(org.apache.coyote.Request req, 
    	                org.apache.coyote.Response res)
        throws Exception {

        if (request == null) {

            ...

            // Set query string encoding
            req.getParameters().setQueryStringEncoding
                (connector.getURIEncoding());
	}
}

也就是说跟pathInfo是一样的，但是千万不要以为就这样了，其实还有另一个地方会让整个事情变得很奇怪。在调用 request.getParameter时，事实上会先调用parseParameters方法，然后才调用 handleQueryParameters，而parseParameters就是第三个设置queryStringEncoding的地方。 getCharacterEncoding首先会去找request中设置的charEncoding，找不到就去找requestHeader中 contentType的编码，还找不到就返回null，这时如果在server.xml中设置了 useBodyEncodingForURI=true，则queryStringEncoding编码就会变成默认编码，即IS08859-1；而考虑另一种情况，如果contentType能找到这个编码（如UTF-8），则queryStringEncoding跟随contentType。

所以，结论是，queryStringEncoding编码的优先级是，第一是随contentType，第二随URIEncoding（即没有设置contentType编码，同时也没有设置useBodyEncodingForURI），第三则是默认编码（即没有设置contentType，设置了useBodyEncodingForURI=true）

protected void parseParameters() {

        parametersParsed = true;

        Parameters parameters = coyoteRequest.getParameters();
        // Set this every time in case limit has been changed via JMX
        parameters.setLimit(getConnector().getMaxParameterCount());

        // getCharacterEncoding() may have been overridden to search for
        // hidden form field containing request encoding
        String enc = getCharacterEncoding();

        boolean useBodyEncodingForURI = connector.getUseBodyEncodingForURI();
        if (enc != null) {
            parameters.setEncoding(enc);
            if (useBodyEncodingForURI) {
                parameters.setQueryStringEncoding(enc);
            }
        } else {
            parameters.setEncoding
                (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);
            if (useBodyEncodingForURI) {
                parameters.setQueryStringEncoding
                    (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);
            }
        }

}

3. httpheader，在InternalInputBuffer的parseHeader中解析，最终会调用到ByteChunk的toStringInternal，里面用到的是DEFAULT_CHARSET，这个默认字符集就是ISO8859-1，意味着不能更改httpheader

public String toStringInternal() {
        if (charset == null) {
            charset = DEFAULT_CHARSET;
        }
        // new String(byte[], int, int, Charset) takes a defensive copy of the
        // entire byte array. This is expensive if only a small subset of the
        // bytes will be used. The code below is from Apache Harmony.
        CharBuffer cb;
        cb = charset.decode(ByteBuffer.wrap(buff, start, end-start));
        return new String(cb.array(), cb.arrayOffset(), cb.length());
    }

4. post中的参数正是上面解析queryStringEncoding中的parameters，也就是说post请求仍然是contentType中的编码方式优先，其次就是默认的ISO8859-1。

到这里，tomcat的编码基本上算是分析完了。但是编码问题涉及的点太多，比如数据库，可以修改数据库的编码或者jdbc连接时指定编码；比如一些框架，如springmvc中的ResponseBody就硬编码了ISO8859-1，可以换用ResponseEntity，或者 Response.getWriter直接输出。总之，查到什么地方有问题，才能对症下药。