java 中编码问题

因为编码问题吃了小亏，特记录一下。

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;

/**
 * 
 *Module:          ZipUtil.java
 *Description:    对字符串的压缩及解压
 *Company:       
 *Author:           vigar
 *Date:             May 6, 2012
 */
public class ZipStrUtil {
    public static void main(String[] args) throws IOException {
        // 字符串超过一定的长度
        String str = "00aa美女a";
        System.out.println("\n原始的字符串为------->" + str);       
       
        String ys = compress(str);
        System.out.println("压缩后的字符串为----->" + ys);         
        String jy = unCompress(ys);

　　　　　/*　byte[] midArray=ys.getBytes();
　　　　　　String midStr=new String(midArray);

　　　　unCompress(midStr);*/

        System.out.println("\n解压缩后的字符串为--->" + jy);
        System.out.println("解压缩后的字符串长度为--->"+jy.length());
        outputFormat(jy);       
        //判断
        if(str.equals(jy)){
            System.out.println("先压缩再解压以后字符串和原来的是一模一样的");
        }
    }

    /**
     * 字符串的压缩  
     * @param str
     *            待压缩的字符串
     * @return    返回压缩后的字符串
     * @throws IOException
     */
    public static String compress(String str) throws IOException {
        if (null == str || str.length() <= 0) {
            return str;
        }
        // 创建一个新的 byte 数组输出流
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        // 使用默认缓冲区大小创建新的输出流
        GZIPOutputStream gzip = new GZIPOutputStream(out);
        // 将 b.length 个字节写入此输出流
        gzip.write(str.getBytes());
        gzip.close();
        // 使用指定的 charsetName，通过解码字节将缓冲区内容转换为字符串
        return out.toString("ISO-8859-1");
    }
    
    /**
     * 字符串的解压
     * @param str
     *            对字符串解压
     * @return    返回解压缩后的字符串
     * @throws IOException
     */
    public static String unCompress(String str) throws IOException {
        if (null == str || str.length() <= 0) {
            return str;
        }
        // 创建一个新的 byte 数组输出流
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        // 创建一个 ByteArrayInputStream，使用 buf 作为其缓冲区数组
        ByteArrayInputStream in = new ByteArrayInputStream(str
                .getBytes("ISO-8859-1"));
        // 使用默认缓冲区大小创建新的输入流
        GZIPInputStream gzip = new GZIPInputStream(in);
        byte[] buffer = new byte[256];
        int n = 0;
        while ((n = gzip.read(buffer)) >= 0) {// 将未压缩数据读入字节数组
            // 将指定 byte 数组中从偏移量 off 开始的 len 个字节写入此 byte数组输出流
            out.write(buffer, 0, n);
        }
        // 使用指定的 charsetName，通过解码字节将缓冲区内容转换为字符串
        return out.toString("gbk");
    }
    
    
    /**
     * 为验证mq传数据异常，用outputFormat打印出byte[]中的内容
     * @param str
     *            对字符串解压
     * @return    返回解压缩后的字符串
     * @throws IOException
     */
    public static void outputFormat(String str)
    {
        byte[] origStr=str.getBytes();
        for(int i =0;i<origStr.length;i++)
        {
            System.out.print(origStr[i]+" ");
        }
        System.out.println("   end");       
    }
}

运行正常

将绿色注释部分加入，则报错

Exception in thread "main" java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at com.boco.fmhandler.wl.adapter.ZipStrUtil.unCompress(ZipStrUtil.java:82)
at com.boco.fmhandler.wl.adapter.ZipStrUtil.main(ZipStrUtil.java:29)

将代码改为如下，则可以正常运行

   byte[] midArray=ys.getBytes("ISO-8859-1");
   String midStr=new String(midArray,"ISO-8859");

深入探究:

String的getBytes()方法是得到一个字串的字节数组，这是众所周知的。但特别要注意的是，此方法将返回该操作系统默认的编码格式的字节数组。如果你在使用这个方法时不考虑到这一点，你会发现在一个平台上运行良好的系统，放到另外一台机器后会产生意想不到的问题。
在中文操作系统中，getBytes方法返回的是一个GBK或者GB2312的中文编码的字节数组，其中中文字符，各占两个字节。而在英文平台中，一般的默认编码是ISO-8859-1;，每个字符都只取一个字节（而不管是否非拉丁字符）。
Java是支持多国编码的，在Java中，字符都是以Unicode进行存储的，这一点只要反编译一个class文件即可看出来。
所以，为了避免这种问题，建议大家都在编码中都使用String.getBytes(String charset)方法，明确指明要得到的编码格式。
下面做一个小例子加深一下印象

public class TestCharset
{
    public static void main(String[] args)
    {
        new TestCharset().execute();
    }

    private void execute()
    {
        try
        {
            String s = "Hello!你好！";
            byte[] bytes = s.getBytes();
            System.out.println("bytes lenght is:" + bytes.length);
        } catch (Exception e)
        {
            e.printStackTrace();
        }
    }
}

windows下执行

String: Hello!你好！
bytes lenght is:12

linux下执行: 可以看出，其执行结果与环境变量LANG相关

[sg@101/udp ~]$ javac -encoding GBK TestCharset.java 
[sg@101/udp ~]$ export LANG=C;
[sg@101/udp ~]$ java TestCharset
String: Hello!???
bytes lenght is:9
[sg@101/udp ~]$ export LANG=zh_CN;
[sg@101/udp ~]$ java TestCharset
String: Hello!你好！
bytes lenght is:12

而将getBytes()改为byte[] bytes = s.getBytes("GBK");后，

结果如下，字串长度不受环境变量影响，即bytes[]中的会按指定的编码格式取出相同的内容

[sg@101/udp ~]$ javac -encoding GBK TestCharset.java 
[sg@101/udp ~]$ export LANG=C;
[sg@101/udp ~]$ java TestCharset
String: Hello!???
bytes lenght is:12
[sg@101/udp ~]$ export LANG=zh_CN;
[sg@101/udp ~]$ java TestCharset
String: Hello!你好！
bytes lenght is:12

参考文章

http://www.360doc.com/content/08/1015/09/61497_1765862.shtml