爬虫笔记二

Python 的 re 模块

re 模块的一般使用步骤如下：

使用 compile() 函数将正则表达式的字符串形式编译为一个 Pattern 对象
通过 Pattern 对象提供的一系列方法对文本进行匹配查找，获得匹配结果，一个 Match 对象。
最后使用 Match 对象提供的属性和方法获得信息，根据需要进行其他的操作

注意：正则表达式使用对特殊字符进行转义，所以如果我们要使用原始字符串，只需加一个 r 前缀

Pattern 对象的一些常用方法：

match 方法：从起始位置开始查找，一次匹配

str = 'one12twothree34four'
pattern = re.compile(r'd+')

match_ret = pattern.match(str, 3, 10)
print (match_ret.group(), match_ret.span())

('12', (3, 5))

search 方法：从任何位置开始查找，一次匹配

str = 'one12twothree34four'
pattern = re.compile(r'd+')

search_ret = pattern.search(str)
print (search_ret.group(), search_ret.span())

('12', (3, 5))

findall 方法：全部匹配，返回列表

str = 'one12twothree34four'
pattern = re.compile(r'd+')


findall_ret = pattern.findall(str, 0, len(str))
pattern = re.compile(r'd+.d*')
result = pattern.findall("123.141593, 'bigcat', 232312, 3.15")
print (findall_ret, "/", result)

(['12', '34'], '/', ['123.141593', '3.15'])

finditer 方法：全部匹配，返回迭代器

pattern = re.compile(r'd+')
result_iter1 = pattern.finditer('hello 123456 789')
result_iter2 = pattern.finditer('one1two2three3four4', 0, 10)
print type(result_iter1)
print 'result1...'
for m1 in result_iter1:  # m1 是 Match 对象
    print 'matching string: {}, position: {}'.format(m1.group(), m1.span())
print 'result2...'
for m2 in result_iter2:
    print 'matching string: {}, position: {}'.format(m2.group(), m2.span())

<type 'callable-iterator'>
result1...
matching string: 123456, position: (6, 12)
matching string: 789, position: (13, 16)
result2...
matching string: 1, position: (3, 4)
matching string: 2, position: (7, 8)

split 方法：分割字符串，返回列表

p = re.compile(r'[s\,;]+')  
print p.split('a,b;; c   d')

['a', 'b', 'c', 'd']

sub 方法：替换

语法：sub(repl, string[, count])

p = re.compile(r'(w+) (w+)')  # w = [A-Za-z0-9]
s = 'hello 123, hello 456'

print (p.findall(s))
print p.sub(r'hello world', s)  # 使用 'hello world' 替换 'hello 123' 和 'hello 456'
print p.sub(r'2 1', s)  # 引用分组


def func(m):
    m_1= m.group(1)
    m_2= m.group(2)
    return 'hi' + ' ' + m.group(2)


print p.sub(func, s)
print p.sub(func, s, 1)   # count 用于指定最多替换次数，不指定时全部替换

[('hello', '123'), ('hello', '456')]
hello world, hello world
123 hello, 456 hello
hi 123, hi 456
hi 123, hello 456

匹配文本中的汉字：

注意： 前缀 ur，其中 r 表示使用原始字符串，u 表示是 unicode 字符串。

title = u'你好，hello，世界'
pattern = re.compile(ur'[u4e00-u9fa5]+')  # ur，其中 r 表示使用原始字符串，u 表示是 unicode 字符串。
ret =  pattern.findall(title)
print ret
for i in ret:
    print i

[u'u4f60u597d', u'u4e16u754c']
你好
世界

注意：贪婪模式与非贪婪模式

贪婪模式：在整个表达式匹配成功的前提下，尽可能多的匹配 ( * )；
非贪婪模式：在整个表达式匹配成功的前提下，尽可能少的匹配 ( ? )；
Python里数量词默认是贪婪的。