Python 正则表达式

正则表达式

需要使用re模块，re模块用于对python的正则表达式的操作

语法:

import re #导入模块名

# 生成要匹配的正则对象 ， ^代表从开头匹配，[0-9]代表匹配0至9的任意一个数字， 所以这里的意思是对传进来的字符串进行匹配，如果这个字符串的开头第一个字符是数字，就代表匹配上了 
p = re.compile("^[0-9]")  

# 按上面生成的正则对象 去匹配 字符串， 如果能匹配成功，这个m就会有值， 否则m为None<br><br>if m: #不为空代表匹配上了 
m = p.match('14534Abc')  

# m.group()返回匹配上的结果，此处为1，因为匹配上的是1这个字符<br>else:<br>　　print("doesn't match.")<br>
print(m.group())

上面的第2 和第3行也可以合并成一行来写：

m = p.match("^[0-9]",'14534Abc')

效果是一样的，区别在于：

第一种方式是提前对要匹配的格式进行了编译（对匹配公式进行解析），这样再去匹配的时候就不用在编译匹配的格式
第二种简写是每次匹配的时候都要进行一次匹配公式的编译
所以，如果你需要从一个5w行的文件中匹配出所有以数字开头的行，建议先把正则公式进行编译再匹配，这样速度会快点

正则表达式元字符：

字符匹配：

.	    ：除换行符以外的任意单个字符
[]	    ：指定范围内字符
[^]	    ：指定范围外字符

次数匹配：

*	    ：任意次，0，1，多次
.*	    ：任意字符 任意次
?	    ：至多1次或0次
+	    ：至少出现1次或多次
{m}	    ：其前面字符出现m次
{m,n}   ：其前面字符出现至少m次，至多n次
{m,}	：其前面字符出现至少m次
{,n}	：其前面字符出现至多n次

位置锚定：

^	    ：匹配字符串的开头
$	    ：匹配字符串的末尾

分组及引用：

()      ：分组，括号内模式会被记录于正则表达式引擎
后向引用 ：1  2  3.....

或：

a|b     ：a或者b
C|cat   ：C或cat
(C|c)at ：Cat或cat

转义字符：

w	    ：匹配字母数字
W	    ：匹配非字母数字
s	    ：匹配任意空白字符，等价于 [	

f].
S	    ：匹配任意非空字符
d	    ；匹配任意数字，等价于 [0-9].
D	    ：匹配任意非数字
A	    ：匹配字符串开始
	    ：匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串
z	    ：匹配字符串结束
G	    ：匹配最后匹配完成的位置。
	    ：匹配一个单词边界，也就是指单词和空格间的位置。例如， 'er' 可以匹配"never" 中的 'er'，但不能匹配 "verb" 中的 'er'。
B	    ：匹配非单词边界。'erB' 能匹配 "verb" 中的 'er'，但不能匹配 "never" 中的 'er'。

      ：匹配一个换行符
	      ：匹配一个制表符
1...9 ：匹配第n个分组的子表达式

正则表达式常用5种操作：

1、re.match(pattern, string, flags=0)

从起始位置开始根据模型去字符串中匹配指定内容，匹配单个

正则表达式
要匹配的字符串
标志位，用于控制正则表达式的匹配方式

import re

obj = re.match('d+', '957evescn')
if obj:
    print(obj.group())

# 输出结果
957

# flags
I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode locale
M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments

2、re.search(pattern, string, flags=0)

匹配整个字符串，返回第一个符合条件的匹配

import re

obj = re.search('d+', 'gmkk957evescn')
if obj:
    print(obj.group())

# 输出结果
957

3、group和groups

import re

a = "123abc456"
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group())

print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(0))
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(1))
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(2))
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(3))

print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).groups())

# 输出结果
123abc456

123abc456
123
abc
456

('123', 'abc', '456')

4、re.findall(pattern, string, flags=0)

找到所有要匹配的字符并返回列表格式

import re

obj = re.findall('D+', 'evescn666gmkk')
print(obj)

# 输出结果
['evescn', 'gmkk']

5、re.sub(pattern, repl, string, count=0, flags=0)

替换匹配到的字符

import re

content = "123abc456"
new_content = re.sub('d+', 'sb', content)
# new_content = re.sub('d+', 'sb', content, 1)
print(new_content)

# 输出结果
sbabcsb

相比于str.replace功能更加强大

6、re.split(pattern, string, maxsplit=0, flags=0)

将匹配到的格式当做分割点对字符串分割成列表

import re

content = "'1 - 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'"
new_content = re.split('*', content)
# new_content = re.split('*', content, 1)
print(new_content)

###### 输出结果
["'1 - 2 ", ' ((60-30+1', '(9-2', '5/3+7/3', '99/4', '2998+10', '568/14))-(-4', '3)/(16-3', "2) )'"]
["'1 - 2 ", " ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'"]
######

content = "'1 - 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'"
new_content = re.split('[+-*/]+', content)
# new_content = re.split('[+-*/]+', content, 1)
print(new_content)

###### 输出结果
["'1 ", ' 2 ', ' ((60', '30', '1', '(9', '2', '5', '3', '7', '3', '99', '4', '2998', '10', '568', '14))', '(', '4', '3)', '(16', '3', "2) )'"]
["'1 ", " 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'"]
######

inpp = '1-2*((60-30 +(-40-5)*(9-2*5/3 + 7 /3*99/4*2998 +10 * 568/14 )) - (-4*3)/ (16-3*2))'
inpp = re.sub('s*', '', inpp)
print(inpp)

new_content = re.split('(([+-*/]?d+[+-*/]?d+){1})', inpp, 1)
print(new_content)

###### 输出结果
1-2*((60-30+(-40-5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2))
['1-2*((60-30+', '-40-5', '*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2))']
######

几个常见正则例子：

匹配手机号

import re

phone_str = "my name is evescn, and my phone number is 18111555666"

m = re.search("(1)([358]d{9})",phone_str)
if m:
    print(m.group())

# 输出结果
18111555666

匹配IPv4

ip_addr = "inet 172.19.133.212 brd 172.19.143.255"
 
m = re.search("(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}", ip_addr)
 
print(m.group())

# 输出结果
172.19.133.212

分组匹配地址

contactInfo = 'Evescn, ChengDu: 028-8888888'

#分组
match = re.search(r'(w+), (w+): (S+)', contactInfo) 
"""
>>> match.group(1)
  'Evescn'
>>> match.group(2)
  'ChengDu'
>>> match.group(3)
  '028-8888888'
"""

match = re.search(r'(?P<name>w+), (?P<addr>w+): (?P<phone>S+)', contactInfo)
"""
>>> print(match.group('name'))
  'Evescn'
>>> print(match.group('addr'))
  'ChengDu'
>>> print(match.group('phone'))
  '028-8888888'
"""

匹配email

email = "evescn.gmkk@163.com   http://blog.evescn.com"

m = re.search(r"[0-9.a-z]{0,26}@[0-9.a-z]{0,20}.[0-9a-z]{0,8}", email)
print(m.group())

# 输出结果
evescn.gmkk@163.com

转载自：http://www.cnblogs.com/alex3714/articles/5143440.html