python基础-正则2

正则函数

Python提供re模块，包含所有正则表达式的功能

由于python的字符串本身也有转义,所以需要注意:

s = "ABC\-001"

对应的正则表达式应为:'ABC-001'

用python的r前缀,就不用考虑转义问题

可以使用 s = r'ABC-001'

对应的正则表达式为:'ABC-001'

match()　　判断是否匹配成功,如果匹配成功,返回一个match对象,否则返回None

test = "用户输入的字符串"
if re.match(r'正则表达式',test):
    print("OK")
else:
    print("failed")

结果:failed

#正则函数
import re

print("---re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None")
print(re.match('www','wwwcom').group())#在起始位置匹配
print(re.match('www','comwww'))#不在起始位置匹配

print("---re.search，扫面整个字符串并返回第一个成功的匹配,后面匹配到的都不会返回")
print(re.search('baidu','www.baidu.com').group())
print(re.search('ai','www.baidu.com').group())

print("---re.findall，从左到右扫描字符串，按顺序返回匹配，如果无匹配结果则返回空列表")
#返回匹配列表；compile，编译后执行速度更快
#p = re.compile('d+')
#print(p.findall('one1two2three3four4'))
print(re.findall('d+','one1two2three3four4'))
print(re.findall('four','one1two2three3four4'))

结果:

---re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None

www
None
---re.search，扫面整个字符串并返回第一个成功的匹配,后面匹配到的都不会返回
baidu
ai
---re.findall，从左到右扫描字符串，按顺序返回匹配，如果无匹配结果则返回空列表
['1', '2', '3', '4']
['four']

分组:

除了简单地判断是否匹配之外，正则表达式还有提取子串的强大功能。用()表示的就是要提取的分组（Group）。比如：

^(d{3})-(d{3,8})$分别定义了两个组，可以直接从匹配的字符串中提取出区号和本地号码：

import re

m = re.match(r"^(d{3})-(d{3,8})$", '010-12345')
print(m)
print(m.group(0))
print(m.group(1))
print(m.group(2))

结果:

<_sre.SRE_Match object at 0x00000000026360B8>
010-12345
010
12345

如果正则表达式中定义了组，就可以在Match对象上用group()方法提取出子串来。

注意到group(0)永远是原始字符串，group(1)、group(2)……表示第1、2、……个子串

import re

print("---sub用于替换字符串中的匹配项")
#第一个参数表示正则,第二个表示替换的字符串,第三个表示要扫描的字符串
print(re.sub('g..t','abc','gaat gbbt gcct'))

print("---split,返回切割后的列表")
print(re.split('+','123+456*789'))

结果:

---sub用于替换字符串中的匹配项
abc abc abc
---split,返回切割后的列表
['123', '456*789']

练习1:

假设有这样一个网址：http://xqtesting.sxl.cn/archive/6688431.html，
请获取这个网址的扩展名，也就是.html这个东东。

import re

print(re.findall('.html','http://xqtesting.sxl.cn/archive/6688431.html'))

结果:

['.html']

练习2:

用Python匹配HTML 标签的时候，<.*>和<.*?>有什么区别？别着急，用这两个来分别匹配
下<div><span>test</span></div>

import re

print(re.findall('<.*>','<div><span>test</span></div>'))
print(re.findall('<.*?>','<div><span>test</span></div>'))

结果:

['<div><span>test</span></div>']
['<div>', '<span>', '</span>', '</div>']