python——re模块（正则表达式）

re 模块的使用：

1.使用compile()函数编译一个parttern对象，例如：parttern=re.compile(r'd+')

2.通过pattern对象提供的一系列属相和方法，对文本进行匹配查找，获得结果，即一个Match对象

match 方法：从起始位置开始查找，一次匹配，匹配失败返回None ----------> match(string[, pos[, endpos]])

m = pattern.match('one12twothree34four', 3, 10) # 从下标3开始，也即从字符串'1'的位置开始匹配，返回一个Match对象, 没有匹配到的话返回None

# -*- conding:utf-8 -*-

import re

pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)  # re.I 表示忽略大小写
m = pattern.match("hello world wide web python") 

print(m)  # <_sre.SRE_Match object; span=(0, 11), match='hello world'>
print(m.group(), type(m.group()))  # hello world <class 'str'>
print(m.group(1)) # hello
print(m.group(2)) # world
print(m.span(), type(m.span()))  # (0, 11) <class 'tuple'>
print(m.groups(), type(m.groups()))  # ('hello', 'world') <class 'tuple'>

search 方法：从任何位置开始查找，一次匹配，匹配失败返回None ----------> search(string[, pos[, endpos]]) 使用同match方法
findall 方法：全部匹配，返回列表，匹配失败返回空列表 ----------> findall(string[, pos[, endpos]])

# -*- conding:utf-8 -*-

import re

# 将正则表达式编译成pattern对象
pattern = re.compile(r'd+')  # 查找数字
rel1 = pattern.findall('hello 123 world 456 ')
print(rel1)   # ['123', '456']

rel2 = pattern.findall('one12two23s34f45f56s78e89t10', 10, 20)  # 指定匹配的起止位置
print(rel2)  # ['34', '45', '56']

#re模块提供一个方法叫compile模块，提供我们输入一个匹配的规则
#然后返回一个pattern实例，我们根据这个规则去匹配字符串
pattern2 = re.compile(r'd+.d*')
#通过partten.findall()方法就能够全部匹配到我们得到的字符串
result = pattern2.findall("123.141593, 'bigcat', 232312, 3.15")
#findall 以 列表形式 返回全部能匹配的子串给result
print(result)  # ['123.141593', '3.15']

finditer 方法：全部匹配，返回迭代器，返回Match对象 ----------> finditer(string[, pos[, endpos]])

# -*- conding:utf-8 -*-

import re

'''finditer跟findall类似'''

pattern = re.compile(r'd+')
resl = pattern.finditer('hello-123-world-456-python-789')

print(resl)  # <callable_iterator object at 0x0000022A886FD470>
print(type(resl))  # <class 'callable_iterator'>    # 迭代器对象
for m in resl:  # m是Match对象， 具体操作见上面的match
    print(m.group())  # 分别打印出123 456 789

split 方法：分割字符串，返回列表 ----------> split(string[, maxsplit])

# -*- conding:utf-8 -*-

import re

'''split方法按照规则将字符串分割后返回列表'''
p = re.compile(r'[s\,;	
]+')
print(p.split('  a  ,    bwf  ;; c '))   # ['', 'a', 'bwf', 'c', '']

sub 方法：替换 ----------> sub(repl, string[, count])

# -*- conding:utf-8 -*-

import re

p = re.compile(r'(w+) (w+)')
s = 'hello 1236 hello 456'
print(p.sub('hello world', s))  # hello world hello world

3.使用match对象的属相和方法获取信息

match.group()

match.groups() # 匹配的所有等同于 match.group()等同于match.group(0)

match.start() # 开始位置

match.end() # 结束位置

match.span() # 返回开始结束的区域跨度

4、匹配中文

中文的Unicode编码范围主要在[u4e00-u9fa5]，没有包括全角中文标点，不过大部分情况下是够用了

# -*- conding:utf-8 -*-

import re

title = '你好，python ， 你好，世界 hello world'
pa = re.compile(r'[u4e00-u9fa5]+')
t = pa.findall(title)
print(t)   # ['你好', '你好', '世界']

5、贪婪匹配-------非贪婪匹配：python默认是贪婪匹配

　　贪婪匹配：在匹配成功的前提下，尽可能多的匹配（*）

　　非贪婪匹配：在匹配成功的前提下，尽可能少的匹配（?）

# -*- conding:utf-8 -*-

import re

s = 'abbbbbbdsddbbbb'

res = re.findall('ab*', s)  # *号是匹配前一个字符0次或无限次
print(res)  # ['abbbbbb']  匹配ab后已经匹配成功，但是由于是贪婪匹配，所以会继续往后尝试匹配

res2 = re.findall('ab*?', s)
print(res2)  # ['a']  匹配a成功后，由于是非贪婪匹配，所以匹配就结束了

加油，一步一步往下走，坚持下去，自己给自己打气加油，workon