java基础---->String中的split方法的原理

  这里面主要介绍一下关于String类中的split方法的使用以及原理。

split函数的说明

split函数java docs的说明:

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array.A zero-width match at the beginning however never produces such empty leading substring.

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

split函数的工作原理大概可以分为以下的几步:

1、遍历查找到regex,把regex前面到上一次的位置中间部分添加到list。这是split函数的核心部分
2、如果没有找到,则返回自身的一维数组
3、是否添加剩余的内容到list中
4、是否去除list里面的空字符串
5、从上面的list里面返回成数组

对于split函数limit的值可能会出现以下的几种情况:

1、Limit < 0, e.g. limit = -1
2、limit = 0,不传默认是0
3、Limit > 0,e.g. limit = 3
4、limit > size,e.g. limit = 20 

split函数的原理

我们通过以下的例子来分析一下split函数的原理。

public void test() {
    String string = "linux---abc-linux-";
    splitStringWithLimit(string, -1);
    splitStringWithLimit(string, 0);
    splitStringWithLimit(string, 3);
    splitStringWithLimit(string, 20);
}

public void splitStringWithLimit(String string, int limit) {
    String[] arrays = string.split("-", limit);
    String result = MessageFormat.format("arrays={0}, length={1}", Arrays.toString(arrays), arrays.length);
    System.out.println(result);
}

// arrays=[linux, , , abc, linux, ], length=6
// arrays=[linux, , , abc, linux], length=5
// arrays=[linux, , -abc-linux-], length=3
// arrays=[linux, , , abc, linux, ], length=6

一、关于第一步的操作,分为两个分支。

1、如果regex是正则表达式的元字符:".$|()[{^?*+\”,或者regex是以开头,以不是0-9, a-z, A-Z结尾的双字符。
    
if (((regex.value.length == 1 &&
        ".$|()[{^?*+\".indexOf(ch = regex.charAt(0)) == -1) ||
        (regex.length() == 2 &&
        regex.charAt(0) == '\' &&
        (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
        ((ch-'a')|('z'-ch)) < 0 &&
        ((ch-'A')|('Z'-ch)) < 0)) &&
    (ch < Character.MIN_HIGH_SURROGATE ||
        ch > Character.MAX_LOW_SURROGATE))
使用index函数查找regex的位置,维护两个下标变量。off表示上一次查找的位置(第一次off是0),next是本次查找的位置。每次查找之后把off到next中间的内容添加到list中。最后更新off的值为next+1。以供下一次的查找。
{
    int off = 0;
    int next = 0;
    boolean limited = limit > 0;
    ArrayList<String> list = new ArrayList<>();
    while ((next = indexOf(ch, off)) != -1) {
        if (!limited || list.size() < limit - 1) {
            list.add(substring(off, next));
            off = next + 1;
        } else {    // last one
            //assert (list.size() == limit - 1);
            list.add(substring(off, value.length));
            off = value.length;
            break;
        }
    }
    // If no match was found, return this
    if (off == 0)
        return new String[]{this};

    // Add remaining segment
    if (!limited || list.size() < limit)
        list.add(substring(off, value.length));

    // Construct result
    int resultSize = list.size();
    if (limit == 0) {
        while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
            resultSize--;
        }
    }
    String[] result = new String[resultSize];
    return list.subList(0, resultSize).toArray(result);
}
2、如果regex不满足上面的判断,比如说是长度大于2的字符。
return Pattern.compile(regex).split(this, limit);

使用正则表达式的mather函数,查找到regex的位置。维护着index变量,相当于上述的off。而matcher查找到的m.start()则相当于上述的next。每次查找之后把index到m.start()中间的内容添加到list中。最后更新off的值为m.end()。以供下一次的查找。 

 1 public String[] split(CharSequence input, int limit) {
 2     int index = 0;
 3     boolean matchLimited = limit > 0;
 4     ArrayList<String> matchList = new ArrayList<>();
 5     Matcher m = matcher(input);
 6 
 7     // Add segments before each match found
 8     while(m.find()) {
 9         if (!matchLimited || matchList.size() < limit - 1) {
10             if (index == 0 && index == m.start() && m.start() == m.end()) {
11                 // no empty leading substring included for zero-width match
12                 // at the beginning of the input char sequence.
13                 continue;
14             }
15             String match = input.subSequence(index, m.start()).toString();
16             matchList.add(match);
17             index = m.end();
18         } else if (matchList.size() == limit - 1) { // last one
19             String match = input.subSequence(index,
20                                                 input.length()).toString();
21             matchList.add(match);
22             index = m.end();
23         }
24     }
25 
26     // If no match was found, return this
27     if (index == 0)
28         return new String[] {input.toString()};
29 
30     // Add remaining segment
31     if (!matchLimited || matchList.size() < limit)
32         matchList.add(input.subSequence(index, input.length()).toString());
33 
34     // Construct result
35     int resultSize = matchList.size();
36     if (limit == 0)
37         while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
38             resultSize--;
39     String[] result = new String[resultSize];
40     return matchList.subList(0, resultSize).toArray(result);
41 }

二、关于第二步:

如果off为0,也就是没有找到regex。直接返回自身的一维数组。 
 

三、关于第三步:

如果limit <= 0或者list的长度还没有达到我们设置的Limit数值。那么就把剩下的内容(最后的一个regex位置到末尾)添加到list中。 
 

四、关于第四步

这里针对的是limit等于0的处理。如果limit=0,那么会把会从后向前遍历list的内容。去除空的字符串(中间出现的空字符串不会移除) 。
 

五、关于第五步

调用List里面的toArray方法,返回数组。 
  

友情链接

原文地址:https://www.cnblogs.com/huhx/p/baseusejavastringsplit.html