Python 基础部分-正则表达式

正则表达式

通过调用re模块，执行正则表达式的功能。

import re

字符匹配（普通字符，元字符）：

普通字符：大多数字符和字母都会和自身匹配

元字符：

#元字符：
.  ^  $  *  +  ?  {  }  [  ]  |  (  )

"." 匹配任意 单个 字符

res = re .findall('ale.x', 'kwaleexsandra') #
print(res)

>>>
['aleex']

"^" 匹配一行的开头位置
res = re .findall('^alex', 'alexsandra') #
print(res)

>>>
['alex']

"$" 匹配一行的结束位置

res = re .findall('alex$', 'sandraalex') #
print(res)

>>>
['alex']

"*" 重复匹配指定的字符任意次数，也可以不匹配 {0, }

res = re .findall('alex*', 'wwwalex') #
print(res)

>>>
['alex']

res = re .findall('alex*', 'wwwale') #
print(res)

>>>
['ale']

res = re .findall('alex*', 'wwwalexxxx') #
print(res)

>>>
['alexxxx']

"+"  匹配指定字符至少一次，最多可能任意多次 {1, }

res = re .findall('alex+', 'wwwalex') #
print(res)

>>>
['alex']

res = re .findall('alex+', 'wwwalexxxx') #
print(res)

>>>
['alexxxx']

res = re .findall('alex+', 'wwwale') #
print(res)

>>>
[]

"?" 匹配指定字符零到一次 {0,1}

res = re .findall('alex?', 'wwwalex') #
print(res)

>>>
['alex']

res = re .findall('alex?', 'wwwalexxxx') #
print(res)

>>>
['alex']

res = re .findall('alex?', 'wwwale') #
print(res)

>>>
['ale']

"{min, max }"  匹配指定字符至少min次，最多max次

res = re .findall('alex{3}', 'wwwalexxxx') #匹配alex x最少匹配3次
print(res)

>>>
['alexxx']


res = re .findall('alex{3,5}', 'wwwalexxxx') #匹配alex x最少匹配3次，最多5次
print(res) 

>>>
['alexxxx']

"[ ]"  匹配[ ]内指定的其中一个字符 

res = re .findall('a[bc]d', 'wwwabd') #
print(res)

>>>
['abd']

res = re .findall('a[bc]d', 'wwwacd') #
print(res)

>>>
['acd']

res = re .findall('a[bc]d', 'wwwabcd') #
print(res)

>>>
[]

"-" 按顺序匹配 " - " 之间的所有字符，需配合"[ ]" 使用

res = re .findall('a-z', 'wwwabd') #
print(res)

>>>
[]

res = re .findall('[a-z]', 'wwwabd') #
print(res)

>>>
['w', 'w', 'w', 'a', 'b', 'd']

res = re .findall('1-9', '127.0.0.1') #
print(res)

>>>
[]

res = re .findall('[1-9]', '127.0.0.1') #
print(res)

>>>
['1', '2', '7', '1']

"[ ]"结合"^" 可是匹配具有“非”的功能

res = re.findall('[^0-9]', "portpostid9987-alex")
print(res)

>>>
['p', 'o', 'r', 't', 'p', 'o', 's', 't', 'i', 'd', '-', 'a', 'l', 'e', 'x']


res = re.findall('[^a-z]', "portpostid9987-alex")
print(res)

>>>
['9', '9', '8', '7', '-']

""
反斜杠后边跟 元字符 去除特殊功能；
反斜杠后边跟 普通字符 实现特殊功能；
引用序号对应的字组所匹配的字符串

d 匹配任意十进制数； 相当于[0-9]
res = re.findall('d', "id9987-alex")
print(res)

>>>
['9', '9', '8', '7']

[d] ""在字符级的符号中同样具有特殊功能
res = re.findall('[d]', "id9987-alex")
print(res)

>>>
['9', '9', '8', '7']

D 匹配任意非数字字符；相当于[^0-9]
res = re.findall('D', "id9987-alex")
print(res)

>>>
['i', 'd', '-', 'a', 'l', 'e', 'x']

s 匹配任意数字字符；相当于[a-zA-z0-9]
res = re.findall('w', "id9987-alex")
print(res)

>>>
['i', 'd', '9', '9', '8', '7', 'a', 'l', 'e', 'x']

S 匹配任意 非 数字字符；相当于[^a-zA-z0-9]
res = re.findall('W', "id9987-alex")
print(res)

>>>
['-']

"( )" 匹配封闭括号中的正则表达式RE，并保存为子组

res = re.search("(ab)*", "aba").group()
print(res)

>>>
ab

 匹配单词边界 XXXb

ret = re.findall(r"abc", "sdasdssd abc asdssdasds")
print(ret)

>>>
['abc']

>>> re.findall("abc", "asdas abc")
[]
>>> re.findall(r"abc", "asdas abc") #"r"使""为含有特殊意义的字符
['abc']
>>> re.findall(r"I", "IMISS IOU") #匹配出单词左边的"I"
['I', 'I']
>>> re.findall(r"I", "IMISS IOU") #匹配出单词有边的"I"
[]

r"""Support for regular expressions (RE).

This module provides regular expression matching operations similar to
those found in Perl.  It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.

Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves.  You can
concatenate ordinary characters, so last matches the string 'last'.

The special characters are:
    "."      Matches any character except a newline.
    "^"      Matches the start of the string.
    "$"      Matches the end of the string or just before the newline at
             the end of the string.
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
             Greedy means that it will match as many repetitions as possible.
    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
    "?"      Matches 0 or 1 (greedy) of the preceding RE.
    *?,+?,?? Non-greedy versions of the previous three special characters.
    {m,n}    Matches from m to n repetitions of the preceding RE.
    {m,n}?   Non-greedy version of the above.
    "\"     Either escapes special characters or signals a special sequence.
    []       Indicates a set of characters.
             A "^" as the first character indicates a complementing set.
    "|"      A|B, creates an RE that will match either A or B.
    (...)    Matches the RE inside the parentheses.
             The contents can be retrieved or matched later in the string.
    (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
    (?:...)  Non-grouping version of regular parentheses.
    (?P<name>...) The substring matched by the group is accessible by name.
    (?P=name)     Matches the text matched earlier by the group named name.
    (?#...)  A comment; ignored.
    (?=...)  Matches if ... matches next, but doesn't consume the string.
    (?!...)  Matches if ... doesn't match next.
    (?<=...) Matches if preceded by ... (must be fixed length).
    (?<!...) Matches if not preceded by ... (must be fixed length).
    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
                       the (optional) no pattern otherwise.

The special sequences consist of "\" and a character from the list
below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.
    
umber  Matches the contents of the group of the same number.
    A       Matches only at the start of the string.
           Matches only at the end of the string.
           Matches the empty string, but only at the start or end of a word.
    B       Matches the empty string, but not at the start or end of a word.
    d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.
    D       Matches any non-digit character; equivalent to [^d].
    s       Matches any whitespace character; equivalent to [ 	

fv] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
    S       Matches any non-whitespace character; equivalent to [^s].
    w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric characters (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.
    W       Matches the complement of w.
    \       Matches a literal backslash.

This module exports the following functions:
    match     Match a regular expression pattern to the beginning of a string.
    fullmatch Match a regular expression pattern to all of a string.
    search    Search a string for the presence of a pattern.
    sub       Substitute occurrences of a pattern found in a string.
    subn      Same as sub, but also return the number of substitutions made.
    split     Split a string by the occurrences of a pattern.
    findall   Find all occurrences of a pattern in a string.
    finditer  Return an iterator yielding a match object for each match.
    compile   Compile a pattern into a RegexObject.
    purge     Clear the regular expression cache.
    escape    Backslash all non-alphanumerics in a string.

Some of the functions in this module takes flags as optional parameters:
    A  ASCII       For string patterns, make w, W, , B, d, D
                   match the corresponding ASCII character categories
                   (rather than the whole Unicode categories, which is the
                   default).
                   For bytes patterns, this flag is the only available
                   behaviour and needn't be specified.
    I  IGNORECASE  Perform case-insensitive matching.
    L  LOCALE      Make w, W, , B, dependent on the current locale.
    M  MULTILINE   "^" matches the beginning of lines (after a newline)
                   as well as the string.
                   "$" matches the end of lines (before a newline) as well
                   as the end of the string.
    S  DOTALL      "." matches any character at all, including the newline.
    X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
    U  UNICODE     For compatibility only. Ignored for string patterns (it
                   is the default), and forbidden for bytes patterns.

This module also defines an exception 'error'.

"""

正则表达式主要功能函数

search()

浏览全部字符串，匹配第一个符合规则的字符串

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

origin = "hello alex bac alex leg alex tms alex 19"

r2 = re.search("a(w+).*(?P<name>d)$", origin) #  "a(w+)" 匹配出a开头的第一个单词，并将a后的字符分成一组
　　　　　　　　　　　　　　　　　　　　　　　　　　　　   .* 匹配字符a开头的第一个单词后的所有单词，再贪婪匹配后一位到数字“1”
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　(?P<name>d)$  字典模式匹配到第一个数字的后一位，将其作为字典key：{"name"}中的值

print(r2.group())
>>>alex bac alex leg alex tms alex 19

print(r2.groups()) #获取模型中匹配到的分组结果
>>>('lex', '9')

print(r2.groupdict())
>>>{'name': '9'}

origin = "hello alex bad alrx lge alex acd 19w"

n = re.search("(a)(w+)", origin)

print(n.group())
>>>
alex

print(n.groups())
>>>
('a', 'lex')

match()

从头开始匹配

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found."""

　　 return _compile(pattern, flags).match(string)

# flags
A = ASCII = sre_compile.SRE_FLAG_ASCII # assume ascii "locale"
I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case #忽略大小写
L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode "locale"
M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline #可多行匹配
S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline #匹配所有字符，包括换行符
X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments #忽略匹配字符串的注释和空格符

匹配方式：

分组：正则出现“()”分组，系统把原来匹配到的所有结果，额外再匹配括号内的内容，把内容放到groups里面，?P<>相当于把组里面的内容加上key，作为键值对放到group里面

不分组：匹配到的所有内容都放入group里面

origin = "hello alex bac alex leg alex tms alex"
r = re.match("hw+", origin)
r2 = re.match("(h)(w+)", origin)
print(r.group()) #获取匹配到的所有结果
print(r2.group())

>>>
hello
hello

print(r.groups()) #获取模型中匹配到的分组结果
print(r2.groups())
>>>
()
('h', 'ello')


#?P<xx>X 以尖角括号里面的值为key，以匹配的值为value
r3 = re.match("(?P<n1>h)(?P<n2>w+)", origin)
print(r.groupdict()) #获取模型中匹配到的分组结果
print(r3.groupdict())
>>>
{}
{'n1':'h', 'n2':'hello'}

findall()

按照字符串循序逐个匹配，匹配成功就跳至匹配到的最后一个字符后，最后把匹配到的所有内容放在一个列表中。

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

贪婪模式中，findall再额外匹配一个内容。

（1）若正则里面有一个组，自动把组里的第一个结果拼接出来，放入列表。

（2）若正则里面有多个组，把组放入一个元组里面，作为列表的一个元素。

origin = "hello alex bad alrx lge alex acd 19w"


r = re.findall("aw+", origin) #匹配含有字符"a"的所有字符串,输出列表
print(r)

>>>
['alex', 'ad', 'alrx', 'alex', 'acd']

r1 = re.findall("a(w+)", origin) #匹配字符"a"后面出现的所有字符串
print(r1)

>>>
['lex', 'd', 'lrx', 'lex', 'cd']

r2 = re.findall("(a)(w+)", origin) #分组匹配含有字母"a", 以及字母"a"后面的字符串,组成元组
print(r2)

>>>
[('a', 'lex'), ('a', 'd'), ('a', 'lrx'), ('a', 'lex'), ('a', 'cd')]

r3 = re.findall("(a)(w+)(x)", origin)
print(r3)

>>>
[('a', 'le', 'x'), ('a', 'lr', 'x'), ('a', 'le', 'x')]

r4 = re.findall("(a)(w+(e)(x)"),origin)
print(r4)

>>>
[('a', 'le', 'e', 'x'), ('a', 'le', 'e', 'x')]

ret = re.findall("a(d+)", "a23b")
print(ret, type(ret))

>>>
['23'] <class 'list'>

ret = re.search("a(d+)", "a23b").group()
print(ret, type(ret))

>>>
a23 <class 'str'>

ret = re.match("a(d+)", "a23b").group()
print(ret, type(ret))

>>>
a23 <class 'str'>

贪婪模式中"()*"，findall再额外匹配一个内容。

>>> re.findall("www.(baidu|laonanhai).com", "asdqwerasdf www.baidu.com") #正则优先匹配括号组的内容
['baidu']
>>> re.findall("www.(?:baidu|laonanhai).com", "asdqwerasdf www.baidu.com") #"?:"除去括号里面baidu的优先匹配的功能
['www.baidu.com']
>>>

r = re.findall("d+wd+", "a2b3c4d5") #匹配顺序：匹配2b3后，从c来说再匹配
print(r)

>>>
['2b3', '4d5']

a = "alex"
n = re.findall("(w)(w)(w)(w)", a)
print(n)

>>>
[('a', 'l', 'e', 'x')]

n2 = re.findall("(w)*", a) #"*"贪婪匹配，默认匹配第四个字符串，再匹配最后没有任何字符的位置。
print(n2)

>>>
['x', '']

n3 = re.findall("", "abcd")
print(n3)

>>>
['', '', '', '', '']

finditer()

生成可迭代的匹配内容

origin = "hello alex bad alrx lge alex acd 19w"

r = re.finditer("(a)(w+(e))(?P<N1>x)", origin)
print(r)

>>>
<callable_iterator object at 0x00275A30>

for i in r:
    print(i)
    >>>
    <_sre.SRE_Match object; span=(6, 10), match='alex'>
    <_sre.SRE_Match object; span=(24, 28), match='alex'>

    print(i.group())
    >>>
    alex
    alex

    print(i.groups())
    >>>
    ('a', 'le', 'e', 'x')
    ('a', 'le', 'e', 'x')

    print(i.groupdict())
    >>>
    {'N1': 'x'}
    {'N1': 'x'}

sub()

def sub(pattern, repl, string, count=0, flags=0):

    return _compile(pattern, flags).sub(repl, string, count)

old_str  = "123askdjf654lkasdfasdfwer456"
new_str = re.sub("d+", "TMD", old_str, 2) #按顺序匹配两组数字，并把匹配到的数字替换成"TMD", 
print(new_str)

>>>
TMDaskdjfTMDlkasdfasdfwer456

subn()

def subn(pattern, repl, string, count=0, flags=0):

    return _compile(pattern, flags).subn(repl, string, count)

old_str  = "123askdjf654lkasdf789asdfwer456"
new_str, count = re.subn("d+", "TMD", old_str)#返回替换的字符串，以及替换的次数
print(new_str, count)

>>>
TMDaskdjfTMDlkasdfTMDasdfwerTMD 4

split()

origin = "hello alex bcd abcd lge acd 19"

n = re.split("aw+", origin) #匹配"a"开头的元素，把"a"开头的元素分离出列表
print(n)
>>>
['hello ', ' bcd ', ' lge ', ' 19']

n1 = re.split("(aw+)", origin, 1) #把匹配出来的字符作列表的元素，并把该元素作为分割把前后的元素，组成了列表。
print(n1)
>>>
['hello ', 'alex', ' bcd abcd lge acd 19']

n2 = re.split("a(w+)", origin, 1)
print(n2)
>>>
['hello ', 'lex', ' bcd abcd lge acd 19']

origin2 = "1-2*((60-30+(-40.0/5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)"

n1 = re.split("(([^()]+))", origin2, 1)
print(n1)
>>>
['1-2*((60-30+', '(-40.0/5)', '*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)']

n2 = re.split("(([^()]+))", origin2, 1)
print(n2)
>>>
['1-2*((60-30+', '-40.0/5', '*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)']

#split应用计算器

origin2 = "1-2*((60-30+(-40.0/5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)"

def f1(content): #计算函数
    return ("1")


while True:
    print(origin2)
    result = re.split("(([^()]+))", origin2,1)
    if len(result) == 3:
        before = result[0]
        content = result[1]
        after = result[2]
        r = f1(content)
        new_str = before + str(r) + after
        origin2 = new_str
    else:
        f_result = f1(origin2)
        print(f_result)
        break

>>>
1-2*((60-30+(-40.0/5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)
1-2*((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)
1-2*((60-30+1*1)-(-4*3)/(16-3*2)
1-2*(1-(-4*3)/(16-3*2)
1-2*(1-1/(16-3*2)
1-2*(1-1/1
1

compile

Compile a pattern into a RegexObject.

"r"的使用方法

正则表达式中"r" = rawstring原生字符，除去多个系统中特殊符号的意义

>>> re.match('blow', "blow")
#没有匹配成功

>>> re.match('\bblow', 'blow') #\b取消了在python的特殊意义,相当于是匹配边际符
<_sre.SRE_Match object; span=(0, 4), match='blow'> 

>>> re.match(r'blow','blow') #使用原生字符, 直接除去python特殊意义
<_sre.SRE_Match object; span=(0, 4), match='blow'>
>>>

匹配小括号内的内容

#优先匹配小括号内的内容
#"(  )" 需要匹配括号小括号里面的内容
#("([^()]")   [^()]匹配括号内外的一个内容,包括数字运算符号
#("([^()]")   [^()]*  匹配括号内外的所有,包括数字运算符号
#("([^()]*)")

source = "2*3+4(2*(3+4.5*5.5)-5)"
# res = re.search("(d+[+-*])")
res = re.search("([^()]*)", source).group()
print(res)

>>>
(3+4.5*5.5)

匹配浮点型的数字

#匹配浮点型的数字
#('d+  匹配一个或多个数字
#('d+.?  .?匹配 匹配小数点 一个或没有 .是出去"."在正则的特殊意义
#('d+.?d*   d*  由于小数点可有可无，若有小数点匹配小数点后的所有数字,若没有小数点则无序匹配，所以用“*”

source_f = "abc3.5555abc"
r = re.search("(d+.?d*d*)", source_f).group()
print(r)

>>>
3.5555

匹配整数的乘除幂运算

#匹配整数的乘除幂运算
#[*/] 匹配*/法
#[*/]|** 匹配*/法和幂远算
#([*/]|**) 匹配运算过程中出现的*/法和幂远算
#('d+.?d+   ([*/]|**)   d+.?d+


res = "(3+4.5*5.5)"

res1 = re.search('d+.?d+([*/]|**)d+.?d+', res)
print(res1, type(res1))

>>><_sre.SRE_Match object; span=(3, 10), match='4.5*5.5'> <class '_sre.SRE_Match'>

匹配ip地址

#匹配ip地址
#IP地址字段为001-255,（1）匹配001-199，（2）匹配200-249，（3）匹配250-255
#[01]?)d?d 匹配首位为0或1的字段，后两位为00-99或0-9的数字
#2[0-4]d 匹配首位为2的字段，后一位为0-4，最后一位为任意数
#25[0-5]). 匹配前两位以“25”开头的字段，后面一位数为0-5之间

w = re.search(r"((([01]?)d?d|2[0-4]d|25[0-5]).){3}([01]?d?d|2[0-4]d|25[0-5].)", '192.168.1.1').group()
print(w, type(w))

>>>
192.168.1.1

计算匹配

l1_expression = re.compile(r'(-?d+)(.d+)?[-+](-?d+)(.d+)?')             #匹配加减的正则表达式

l2_expression = re.compile(r'(-?d+)(.d+)?[/*](-?d+)(.d+)?')             #匹配乘除的正则表达式

l3_expression = re.compile(r'(-?d+)(.d+)?*-(-?d+)(.d+)?')              #匹配乘负数的正则表达式

l4_expression = re.compile(r'(-?d+)(.d+)?/-(-?d+)(.d+)?')               #匹配除负数的正则表达式

l5_expression = re.compile(r'([^()]*)')