Java笔记(三十)……正则表达式

概述

符合一定规则的表达式

专门用于操作字符串

特点：

用于一些特定的符号来表示一些代码操作，这样就可以简化书写

所以学习正则表达式，就是在学习一些特殊符号的使用

好处：

可以简化对字符串的复杂操作

弊端：

符号定义越多，正则越长，阅读性越差

规则

下面只是一些简单的规则，具体详细规则查询API文档

Greedy 数量词
X?    X，一次或一次也没有
X*    X，零次或多次
X+    X，一次或多次
X{n}    X，恰好 n 次
X{n,}    X，至少 n 次
X{n,m}    X，至少 n 次，但是不超过 m 次

字符类
[abc]    a、b 或 c（简单类）
[^abc]    任何字符，除了 a、b 或 c（否定）
[a-zA-Z]    a 到 z 或 A 到 Z，两头的字母包括在内（范围）
[a-d[m-p]]    a 到 d 或 m 到 p：[a-dm-p]（并集）
[a-z&&[def]]    d、e 或 f（交集）
[a-z&&[^bc]]    a 到 z，除了 b 和 c：[ad-z]（减去）
[a-z&&[^m-p]]    a 到 z，而非 m 到 p：[a-lq-z]（减去）

预定义字符类
.    任何字符（与行结束符可能匹配也可能不匹配）
d    数字：[0-9]
D    非数字： [^0-9]
s    空白字符：[ x0Bf ]
S    非空白字符：[^s]
w    单词字符：[a-zA-Z_0-9]
W    非单词字符：[^w]

组和捕获

捕获组可以通过从左到右计算其开括号来编号。例如，在表达式 ((A)(B(C))) 中，存在四个这样的组：

1        ((A)(B(C)))
2        A
3        (B(C))
4        (C)
组零始终代表整个表达式。

Pattern与Matcher

java.util.regex包中定义了正则操作的相关对象

Pattern：正则表达式的编译表现形式，内部封装了多种正则模式

Matcher：正则匹配引擎，它是基于Pattern产生的，每个Pattern对象可与多个字符串进行匹配，所以可以产生多个匹配器Matcher

典型的调用顺序是

   1: Pattern p = Pattern.compile("a*b");

   2: Matcher m = p.matcher("aaaaab");

   3: boolean b = m.matches();

常见操作

String类封装的多种字符串操作其实都是基于Pattern与Matcher操作的，只不过封装起来更方便，但是功能比较单一

String类中

匹配
boolean matches(String regex)
Tells whether or not this string matches the given regular expression.

切割
String[] split(String regex)
Splits this string around matches of the given regular expression.

替换
String replaceAll(String regex, String replacement)
Replaces each substring of this string that matches the given regular expression with the given replacement.

Matcher类中

boolean matches()
Attempts to match the entire region against the pattern.

String replaceAll(String replacement)
Replaces every subsequence of the input sequence that matches the pattern with the given replacement string.

boolean find()
Attempts to find the next subsequence of the input sequence that matches the pattern.

String group()
Returns the input subsequence matched by the previous match.

实例

去叠词

   1: String str = "我我...我我我...我我要..要要要要.要学学.学.学学学..编编编编.编..编编编.程.程程程.程程程程.程程";

2:

   3: //去掉字符“.”

   4: str = str.replaceAll("\.+","");

   5: //叠词变单字

   6: str = str.replaceAll("(.)\1+","$1");

7:

   8: System.out.println(str);

IP排序

   1: String str = "192.168.0.1 10.10.10.10 2.2.2.2 255.255.255.255";

2:

   3: //每段都加上2个0

   4: str = str.replaceAll("(\d+)","00$1");

5:

   6: //每段ip位数对齐成3位

   7: str = str.replaceAll("0+(\d{3})","$1");

8:

   9: //分段存放

  10: String[] arr = str.split(" ");

11:

  12: //存放到set集合中进行字符串排序

  13: TreeSet<String> ts = new TreeSet<String>();

14:

  15: for(String s : arr)

  16: {

  17:     ts.add(s);

  18: }

19:

  20: //去掉补齐的0位

  21: for(String s: ts)

  22: {

  23:     s = s.replaceAll("0+([1-9]+)","$1");

  24:     System.out.println(s);

  25: }

26:

  27: System.out.println(args.length);

网页爬虫

   1: public static void getMail() throws Exception

   2: {

   3:     //网络上的url资源

   4:     URL url = new URL("http://www.cnblogs.com/feng-c-x/p/3300060.htm");

5:

   6:     //建立连接

   7:     URLConnection conn = url.openConnection();

8:

   9:     //封装读取流

  10:     BufferedReader bufr =

  11:         new BufferedReader(new InputStreamReader(conn.getInputStream()));

12:

13:

  14:     String line = null;

15:

  16:     //定义正则邮箱规则

  17:     String regex = "\w+@\w+(\.\w+){1,3}";

18:

  19:     //编译成正则对象

  20:     Pattern p = Pattern.compile(regex);

21:

  22:     //遍历查找邮箱字符串

  23:     while( (line = bufr.readLine()) != null)

  24:     {

  25:         Matcher m = p.matcher(line);

26:

  27:         while(m.find())

  28:         {

  29:             System.out.println(m.group());

  30:         }

  31:     }

  32: }