String内存陷阱简介

String 方法用于文本分析及大量字符串处理时会对内存性能造成一些影响。可能导致内存占用太大甚至OOM。

一、先介绍一下String对象的内存占用

一般而言，Java 对象在虚拟机的结构如下：
•对象头（object header）：8 个字节（保存对象的 class 信息、ID、在虚拟机中的状态）
•Java 原始类型数据：如 int, float, char 等类型的数据
•引用（reference）：4 个字节
•填充符（padding）

String定义：

JDK6:
private final char value[];
private final int offset;
private final int count;
private int hash;

JDK6的空字符串所占的空间为40字节

JDK7:
private final char value[];
private int hash;
private transient int hash32;

JDK7的空字符串所占的空间也是40字节

JDK6字符串内存占用的计算方式：
首先计算一个空的 char 数组所占空间，在 Java 里数组也是对象，因而数组也有对象头，故一个数组所占的空间为对象头所占的空间加上数组长度，即 8 + 4 = 12 字节 , 经过填充后为 16 字节。

那么一个空 String 所占空间为：

对象头（8 字节）+ char 数组（16 字节）+ 3 个 int（3 × 4 = 12 字节）+1 个 char 数组的引用 (4 字节 ) = 40 字节。

因此一个实际的 String 所占空间的计算公式如下：

8*( ( 8+12+2*n+4+12)+7 ) / 8 = 8*(int) ( ( ( (n) *2 )+43) /8 )

其中，n 为字符串长度。

二、举个例子：

1、substring

package demo;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.InputStreamReader;

public class TestBigString

{

    private String strsub;

    private String strempty = new String();

    public static void main(String[] args) throws Exception

    {

        TestBigString obj = new TestBigString();

        obj.strsub = obj.readString().substring(0,1);

        Thread.sleep(30*60*1000);

    }

    private String readString() throws Exception

    {

        BufferedReader bis = null;

        try

        {

            bis = new BufferedReader(new InputStreamReader(new FileInputStream(newFile("d:\teststring.txt"))));

            StringBuilder sb = new StringBuilder();

            String line = null;

            while((line = bis.readLine()) != null)

            {

                sb.append(line);

            }

            System.out.println(sb.length());

            return sb.toString();

        }

        finally

        {

            if (bis != null)

            {

                bis.close();

            }

        }

    }

}

其中文件"d:\teststring.txt"里面有33475740个字符，文件大小有35M。

用JDK6来运行上面的代码，可以看到strsub只是substring(0,1)只取一个，count确实只有1，但其占用的内存却高达接近67M。

然而用JDK7运行同样的上面的代码，strsub对象却只有40字节

什么原因呢？

来看下JDK的源码：

JDK6：

 1 public String substring(int beginIndex, int endIndex) {
 2 
 3     if (beginIndex < 0) {
 4 
 5         throw new StringIndexOutOfBoundsException(beginIndex);
 6 
 7     }
 8 
 9     if (endIndex > count) {
10 
11         throw new StringIndexOutOfBoundsException(endIndex);
12 
13     }
14 
15     if (beginIndex > endIndex) {
16 
17         throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
18 
19     }
20 
21     return ((beginIndex == 0) && (endIndex == count)) ? this :
22 
23         new String(offset + beginIndex, endIndex - beginIndex, value);
24 
25 }
26 
27 // Package private constructor which shares value array for speed.
28 
29     String(int offset, int count, char value[]) {
30 
31     this.value = value;
32 
33     this.offset = offset;
34 
35     this.count = count;
36 
37 }

JDK7:

 1 public String substring(int beginIndex, int endIndex) {
 2 
 3         if (beginIndex < 0) {
 4 
 5             throw new StringIndexOutOfBoundsException(beginIndex);
 6 
 7         }
 8 
 9         if (endIndex > value.length) {
10 
11             throw new StringIndexOutOfBoundsException(endIndex);
12 
13         }
14 
15         int subLen = endIndex - beginIndex;
16 
17         if (subLen < 0) {
18 
19             throw new StringIndexOutOfBoundsException(subLen);
20 
21         }
22 
23         return ((beginIndex == 0) && (endIndex == value.length)) ? this
24 
25                 : new String(value, beginIndex, subLen);
26 
27 }
28 
29 public String(char value[], int offset, int count) {
30 
31         if (offset < 0) {
32 
33             throw new StringIndexOutOfBoundsException(offset);
34 
35         }
36 
37         if (count < 0) {
38 
39             throw new StringIndexOutOfBoundsException(count);
40 
41         }
42 
43         // Note: offset or count might be near -1>>>1.
44 
45         if (offset > value.length - count) {
46 
47             throw new StringIndexOutOfBoundsException(offset + count);
48 
49         }
50 
51         this.value = Arrays.copyOfRange(value, offset, offset+count);
52 
53     }

可以看到原来是因为JDK6的String.substring()所返回的 String 仍然会保存原始 String的引用，所以原始String无法被释放掉，因而导致了出乎意料的大量的内存消耗。

JDK6这样设计的目的其实也是为了节约内存，因为这些 String 都复用了原始 String，只是通过 int 类型的 offerset, count 等值来标识substring后的新String。

然而对于上面的例子，从一个巨大的 String 截取少数 String 为以后所用，这样的设计则造成大量冗余数据。因此有关通过 String.split()或 String.substring()截取 String 的操作的结论如下：

•对于从大文本中截取少量字符串的应用，String.substring()将会导致内存的过度浪费。
•对于从一般文本中截取一定数量的字符串，截取的字符串长度总和与原始文本长度相差不大，现有的 String.substring()设计恰好可以共享原始文本从而达到节省内存的目的。

既然导致大量内存占用的根源是 String.substring()返回结果中包含大量原始 String，那么一个减少内存浪费的的途径就是去除这些原始 String。如再次调用 newString构造一个的仅包含截取出的字符串的 String，可调用 String.toCharArray()方法：

String newString = new String(smallString.toCharArray());

2、同样，再看看split方法

 1 public class TestBigString
 2 
 3 {
 4 
 5     private String strsub;
 6 
 7     private String strempty = new String();
 8 
 9     private String[] strSplit;
10 
11     public static void main(String[] args) throws Exception
12 
13     {
14 
15         TestBigString obj = new TestBigString();
16 
17         obj.strsub = obj.readString().substring(0,1);
18 
19         obj.strSplit = obj.readString().split("Address:",5);
20 
21         Thread.sleep(30*60*1000);
22 
23     }

JDK6中分割的字符串数组中，每个String元素占用的内存都是原始字符串的内存大小(67M):

而JDK7中分割的字符串数组中，每个String元素都是实际的内存大小:

原因：

JDK6源代码：

 1 public String[] split(String regex, int limit) {
 2 
 3     return Pattern.compile(regex).split(this, limit);
 4 
 5     }
 6 
 7 public String[] split(CharSequence input, int limit) {
 8 
 9         int index = 0;
10 
11         boolean matchLimited = limit > 0;
12 
13         ArrayList<String> matchList = new ArrayList<String>();
14 
15         Matcher m = matcher(input);
16 
17         // Add segments before each match found
18 
19         while(m.find()) {
20 
21             if (!matchLimited || matchList.size() < limit - 1) {
22 
23                 String match = input.subSequence(index, m.start()).toString();
24 
25                 matchList.add(match);
26 
27 public CharSequence subSequence(int beginIndex, int endIndex) {
28 
29         return this.substring(beginIndex, endIndex);
30 
31     }

三、其他方面：

1、String a1 = “Hello”; //常量字符串，JVM默认都已经intern到常量池了。
创建字符串时 JVM 会查看内部的缓存池是否已有相同的字符串存在：如果有，则不再使用构造函数构造一个新的字符串，
直接返回已有的字符串实例；若不存在，则分配新的内存给新创建的字符串。
String a2 = new String(“Hello”); //每次都创建全新的字符串

2、在拼接静态字符串时，尽量用 +，因为通常编译器会对此做优化。

1 public String constractStr()
2 
3     {
4 
5         return "str1" + "str2" + "str3";
6 
7 }

对应的字节码：

Code:

0: ldc #24; //String str1str2str3 --将字符串常量压入栈顶

2: areturn

3、在拼接动态字符串时，尽量用 StringBuffer 或 StringBuilder的 append，这样可以减少构造过多的临时 String 对象（javac编译器会对String连接做自动优化）：

1 public String constractStr(String str1, String str2, String str3)
2 
3     {
4 
5         return str1 + str2 + str3;
6 
7 }

对应字节码（JDK1.5之后转换为调用StringBuilder.append方法）：

Code:

0:   new     #24; //class java/lang/StringBuilder

3:   dup

4:   aload_1

5:   invokestatic    #26; //Method java/lang/String.valueOf:(Ljava/lang/Object;)Ljava/lang/String;

8:   invokespecial   #32; //Method java/lang/StringBuilder."<init>":(Ljava/lang/String;)V

11:  aload_2

12:  invokevirtual   #35; //Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;

15:  aload_3

16:  invokevirtual   #35; //Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;  ――调用StringBuilder的append方法

19:  invokevirtual   #39; //Method java/lang/StringBuilder.toString:()Ljava/lang/String;

22:  areturn     ――返回引用