遇到了一个小虫,特记录之。
1.正则表达式及英文的处理如下:
>>> import re >>> b='adfasdfasf<1safadsaf>23wfsa<13131>' >>> pat = re.compile('<.*?>') >>> pat.findall(b) ['<1safadsaf>', '<13131>']
2. 换成中文貌似就没反应了
>>> msg="<Fault warning -- xb4xedxcexf3! xb2xfaxc6xb7xb1xe0xbaxc5xb1xd8xd0xe8xcexa8xd2xbb,xd0xc2xb1xe0xbaxc53123xb6xd4xd3xa6xb5xc4xb2xfaxc6xb7xd2xd1xbexadxb4xe6xd4xda! xc8xe7xb9xfbxc4xfaxcaxd4xcdxbcxcdxa8xb9xfdxb8xb4xd6xc6xc0xb4xc9xfdxbcxb6xb2xfaxc6xb7xd4xf2xcbxb5xc3xf7xb4xcbxb2xfaxc6xb7xd2xd1xbexadxb4xe6xd4xdaxc9xfdxbcxb6xb0xe6xa3xacxc7xebxc1xf4xd2xe2xa1xa3: ''>" >>> pat.findall(msg) []
仔细分析了下貌似因为其中的 字符!
甚为不解,又try了一把:
>>> msg ='< >asdasf<asdfaf>' >>> pat.findall(msg) ['< >', '<asdfaf>'] >>> msg='< >adf<afd>' >>> pat.findall(msg) ['<afd>'] >>> msg='<s>adaf<asdfa>' >>> pat.findall(msg) ['<\s>', '<asdfa>'] >>> msg='< >asdfasf<asfa>' >>> pat.findall(msg) ['<asfa>']
确实点号无法匹配特殊字符' '!
在这里找到了说明。
. | 匹配除 " " 之外的任何单个字符。要匹配包括 ' ' 在内的任何字符,请使用象 '[. ]' 的模式。 |
3.[. ]的尴尬情况
>>> pat= re.compile('<[. ]*?>') >>> pat.findall(msg) ['< >']
>>> msg '< >asdfasf<asfa>'
>>> msg='< asdfs>adaf<adaf>' >>> pat.findall(msg) []
谷歌了一番,找到了答案,在这里。即加入DOTALL选项。如下:
>>> pat = re.compile('<.*?>',re.DOTALL) >>> pat.findall(msg) ['< asdfs>', '<adaf>']