RegularExpression 1

[] 数据集

编译标志：

编译标志让你可以修改正则表达式的一些运行方式。在 re 模块中标志可以使用两个名字，一个是全名如 IGNORECASE，一个是缩写，一字母形式如 I。

1 标志  含义
2 DOTALL, S   使 . 匹配包括换行在内的所有字符
3 IGNORECASE, I   使匹配对大小写不敏感
4 LOCALE, L   做本地化识别（locale-aware）匹配
5 MULTILINE, M    多行匹配，影响 ^ 和 $
6 VERBOSE, X  能够使用 REs 的 verbose 状态，使之被组织得更清晰易懂
7 I
8 IGNORECASE

例如：

import re
print (re.split(r'[a-fA-F0-9p-q]','asdfghjkl;zxcvbnmqwertyuio1234567890q',re.I|re.M))

['', 's', '', 'ghjkl;zx', 'v', 'nm', 'w', 'rtyuio', '', '', '4567890q']

论re.M的重要性

print (re.split(r'[a-fA-F0-9]','aKadYcBw3a',re.I))
>>
['', 'K', 'dYcBw3a']

然而加上re.M之后呢？

print (re.split(r'[a-fA-F0-9]','aKadYcBw3a',re.I|re.MULTILINE))
['', 'K', '', 'Y', '', 'w', '', '']

天哪，没有re.M，'dYcBw3a'没办法匹配？这是什么鬼？

[][]是什么意思呢？

print (re.split(r'[a-f][a-f]','aKadYcBw3aadawgfdsgeadwadadsgfh',re.I|re.MULTILINE))
>>
['aK', 'YcBw3', '', 'wg', 'sg', 'dw', '', 'sgfh']

就是，[1][2]要符合1,2两个条件，并且两个条件要连在一起。

数字：d (同[d]),非数字D (同[^d])

print re.findall(r'd','finqwen324 main st.iewasd',re.I|re.M)
['3', '2', '4']

print re.findall(r'd*','finqwen324 main st.iewasd',re.I|re.M)
['', '', '', '', '', '', '', '324', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
print re.findall(r'd?','finqwen324 main st.iewasd')
['', '', '', '', '', '', '', '3', '2', '4', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
print re.findall(r'd{1,5}','finqwen324 main st.iewasd')
['324']
print re.findall(r'd{1,2}','finqwen324 main st.iewasd')
['32', '4']#会匹配大的次数，然后再匹配小的次数

空格：s 同上

*=0 or more

+=1 or more

?=0 or 1 of...

{5}=exact number of...

{1,60}=range on number of...(说明，m,n可以省略，省略m,则匹配0-n次，省略n则匹配m至无限次)

print re.findall(r'd{1,3}sw+','finqwen324 main st.iewasd')
['324 main']
print re.findall(r'd{1,3}sw+sw+.','finqwen324 main st.iewasd')
['324 main st.']

. 匹配除之外的任意字符，再DOTALL中也能匹配

转义字符

test:

import re,urllib
try:
    import urllib.request
except:
    pass
sites = 'google bing baidu cnn bbc'.split()
pat = re.compile(r'<title>+.*</title>+',re.I|re.M)
for site in sites:
    print 'Searching: ' + site
    try:
        u = urllib.urlopen('http://'+site+'.com')
    except:
        u = urllib.request.urlopen('http://'+site+'.com')
    text = u.read()
    title = re.findall(pat,str(text))
    if len(title)>0:
        print title[0]

结果：

Searching: google
<title>Google</title>
Searching: bing
<title>Bing</title>
Searching: baidu
Searching: cnn
<title>CNN - Breaking News, U.S., World, Weather, Entertainment & Video News</title>
Searching: bbc
<title>BBC - Homepage</title>

更新中......