Python之路【第二十一篇】：正则表达式

正则表达式用于进行字符匹配，在正则表达式中有普通字符和元字符两种字符

python中re模块提供了正则表达式相关操作

字符：

　　. 匹配除换行符以外的任意字符
　　w 匹配字母或数字或下划线或汉字

　　W 匹配非字母或数字或下划线或汉字
　　s 匹配任意的空白符,相当于[ fv]

　　S 匹配任何非空白字符，相当于[^ fv]
　　d 匹配数字

　　D 匹配非数字，相当于[^0-9]
　　匹配单词的开始或结束
　　^ 匹配字符串的开始
　　$ 匹配字符串的结束

[] 字符集，在字符集中，一般的元字符就是去意义了，但也有没有失去意义的；例如[bc],b或c;[a-z] ;[^1-9],非的作用，除了1-9；

() 表示分组

| 表示或的意思，如([*/]|**)就表示乘除或者幂运算

次数：

　　* 重复零次或更多次
　　+ 重复一次或更多次
　　? 重复零次或一次
　　{n} 重复n次
　　{n,} 重复n次或更多次
　　{n,m} 重复n到m次

下划线

后面跟元字符的话去除元字符的特殊功能，
跟普通字符的话就会实现特殊功能
引用序号对应的字组所匹配的字符串，如："(alex)(eric)com2",一个括号一个组，说明com后面还有匹配一个eric

贪婪匹配与非贪婪匹配

比如说*重复前一个字符零次或更多次，贪婪匹配的话就是按照最大的来匹配

在*后面加上？就变成非贪婪匹配了

match

# match，从起始位置开始匹配，匹配成功返回一个对象，未匹配成功返回None
 
 match(pattern, string, flags=0)
 # pattern： 正则模型
 # string ： 要匹配的字符串
 # falgs  ： 匹配模式
     X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
     I  IGNORECASE  Perform case-insensitive matching.
     M  MULTILINE   "^" matches the beginning of lines (after a newline)
                    as well as the string.
                    "$" matches the end of lines (before a newline) as well
                    as the end of the string.
     S  DOTALL      "." matches any character at all, including the newline.
 
     A  ASCII       For string patterns, make w, W, , B, d, D
                    match the corresponding ASCII character categories
                    (rather than the whole Unicode categories, which is the
                    default).
                    For bytes patterns, this flag is the only available
                    behaviour and needn't be specified.
      
     L  LOCALE      Make w, W, , B, dependent on the current locale.
     U  UNICODE     For compatibility only. Ignored for string patterns (it
                    is the default), and forbidden for bytes patterns.

代码示例

# 无分组
r = re.match("hw+", origin)
print(r.group())     # 获取匹配到的所有结果
print(r.groups())    # 获取模型中匹配到的分组结果
print(r.groupdict()) # 获取模型中匹配到的分组结果

# 有分组
# 为何要有分组？提取匹配成功的指定内容（先匹配成功全部正则，再匹配成功的局部内容提取出来）

r = re.match("h(w+).*(?P<name>d)$", origin)
print(r.group())     # 获取匹配到的所有结果
print(r.groups())    # 获取模型中匹配到的分组结果
print(r.groupdict()) # 获取模型中匹配到的分组中所有执行了key的组

search

# search,浏览整个字符串去匹配第一个，未匹配成功返回None
# search(pattern, string, flags=0)

代码示例

# 无分组
r = re.search("aw+", origin)
print(r.group())     # 获取匹配到的所有结果
print(r.groups())    # 获取模型中匹配到的分组结果
print(r.groupdict()) # 获取模型中匹配到的分组结果

# 有分组
r = re.search("a(w+).*(?P<name>d)$", origin)
print(r.group())     # 获取匹配到的所有结果
print(r.groups())    # 获取模型中匹配到的分组结果
print(r.groupdict()) # 获取模型中匹配到的分组中所有执行了key的组

findall

# findall，获取非重复的匹配列表；如果有一个组则以列表形式返回，且每一个匹配均是字符串；如果模型中有多个组，则以列表形式返回，且每一个匹配均是元祖；
# 空的匹配也会包含在结果中
#findall(pattern, string, flags=0)

代码示例

# 无分组
r = re.findall("aw+",origin)
print(r)

# 有分组
origin = "hello alex bcd abcd lge acd 19"
r = re.findall("a((w*)c)(d)", origin)
print(r)

sub

# sub，替换匹配成功的指定位置字符串
 
sub(pattern, repl, string, count=0, flags=0)
# pattern： 正则模型
# repl   ： 要替换的字符串或可执行对象
# string ： 要匹配的字符串
# count  ： 指定匹配个数
# flags  ： 匹配模式

代码示例

# 与分组无关

origin = "hello alex bcd alex lge alex acd 19"
r = re.sub("aw+", "999", origin, 2)
print(r)

split

# split，根据正则匹配分割字符串
 
split(pattern, string, maxsplit=0, flags=0)
# pattern： 正则模型
# string ： 要匹配的字符串
# maxsplit：指定分割个数
# flags  ： 匹配模式

代码示例

# 无分组
origin = "hello alex bcd alex lge alex acd 19"
r = re.split("alex", origin, 1)
print(r)

# 有分组
        
origin = "hello alex bcd alex lge alex acd 19"
r1 = re.split("(alex)", origin, 1)
print(r1)
r2 = re.split("(al(ex))", origin, 1)
print(r2)

附加说明

re.match(规则，字符串，标志位)，标志位是说明匹配的方式，re.I使匹配对大小写不敏感，re.L做本地化识别匹配，re.M多行匹配，re.S使.匹配包含换行符在内的所有字符

re.search(规则，字符串，标志位),标志位是说明匹配的方式，re.I使匹配对大小写不敏感，re.L做本地化识别匹配，re.M多行匹配，re.S使.匹配包含换行符在内的所有字符

match和search匹配成功之后会返回一个对象，需要调用相应的方法来进行输出：group()返回匹配的字符串,start()返回匹配开始的位置，end()返回匹配结束的位置，span()返回一个元组包含（开始，结束）的位置
group()默认参数为0，group(n)，返回组号为n所匹配的字符串，group(n,m)返回组号为n,m所匹配的字符串

sub(规则，新的字符串，旧的字符串，max=0)用于替换,max用于规定替换几个，如果不加max最后面这个参数，返回时会返回替换好的字符串和一个数字，表示替换了几次

compile,把象，可以把那些经常需要匹配的表达式编译成正则表达式对象，这样可以提高一定的效率

在组的前后均有限制条件的话，用于非贪婪匹配的？就不起作用了?有一个正则表达式，如：a(d+)b,字符串为a23b,使用findall结果是23，使用search结果是a23b,并且此时使用?，也不会变为非贪婪模式

在python中 a 等都有一些特殊意思，举个例子
f=open('d:abc.txt','r')会报错，因为a有特殊含义，所以需要加，f=open(r'd:abc.txt','r')
在字符串前面加r就表示原生字符串
在很多情况下，使用findall函数会优先获取组里面的内容，这时候在前面加?:就可以把这种权限取消，例如：'www.(?:baidu).com',这样就会匹配出www.baidu.com而不是只匹配出baidu

常用正则表达式

IP：
^(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}$
手机号：
^1[3|4|5|8][0-9]d{8}$
邮箱：
[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+(.[a-zA-Z0-9_-]+)+

随堂笔记

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import re

#使用模块要看模块的源文件
#默认情况下是贪婪的
#最常用的re函数，match与findall最常用
# re.match()        匹配开头
# re.search()       浏览全部字符串，匹配第一个符合规则的字符串
# re.findall()      将匹配到的所有内容放入一个列表中
# #re.finditer()    与findall的区别类似于range与xrange分区别，只有迭代才能输出
# re.split()
# re.sub()

#match,search,findall这些方法概括起来有两种匹配方式:简单匹配和分组匹配
#match,search会返回一个对象，对象里有多种方法，这些对象里最最基本的有group(),groups(),groupdict()这三个方法
#match,search,findall里面都有三个参数，正则表达式，源字符串，标志位
#为何要有分组
#分组就是取已经匹配到的内容里取取值，有几个括号就取几次，不管嵌套有多深
#简单匹配

orign = 'hello alex bcd alev lge alec acd 19'
r = re.match('(h)(w+)',orign)
print(r.group()) #获取匹配到的所有结果
print(r.groups())#获取模型中匹配到的分组结果

r = re.match('(?P<n1>h)(?P<n2>w+)',orign)
print(r.groupdict())#获取模型中匹配到的分组中所有执行了key的组
#?P<键>固定写法

#search与match一样，只不过一个匹配开头，一个匹配第一个
orign = 'hello alex bcd alev lge alec acd 19'
r = re.search('(a)(w+)',orign)
print(r.group())
print(r.groups())

r = re.search('(?P<n1>a)(?P<n2>w+)',orign)
print(r.groupdict())

print(re.findall('d+wd+','a2b3c4d5'))
#输出结果为[2b3,4d5]匹配的顺序是这样的，匹配到2b3后，再从c开始往后匹配，匹配到4d5
#正则表达式匹配的顺序没有匹配到的时候是按个往后找，一旦匹配到，把匹配到的拿走，从匹配后的往后找
print(re.findall('','a2b3c4d5'))
#注意看这句的输出结果为['', '', '', '', '', '', '', '', '']


#分割
origin = 'hello alex bcd abcd lge acd 19'
print(re.split('aw+',origin,1))
#输出结果为：['hello ', ' bcd abcd lge acd 19']
print(re.split('(aw+)',origin,1))
#输出结果为：['hello ', 'alex', ' bcd abcd lge acd 19']
print(re.split('a(w+)',origin,1))
#输出结果为：['hello ', 'lex', ' bcd abcd lge acd 19']

#1-2*((60-30+(-40.0/5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2))
origin = '1-2*((60-30+(-40.0/5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2))'
print(re.findall('([^()]+)',origin))
#可以取得内层括号
#输出结果为：['(-40.0/5)', '(9-2*5/3+7/3*99/4*2998+10*568/14)', '(-4*3)', '(16-3*2)']

#利用re.split() 写计算器

re.sub()
#替换

三样东西有助于缓解生命的疲劳：希望、睡眠和微笑。---康德