Python challenge 3

第三个主题地址：http://www.pythonchallenge.com/pc/def/ocr.html
Hint1：recognize the characters. maybe they are in the book, but MAYBE they are in the page source.
Hint2: 网页源代码的凝视中有: find rare characters in the mess below；以下是一堆字符。
显然是从这对字符中找出现次数最少的；注意忽略空白符。出现次数相同多的字符按出现次数排序。

import re
import urllib

# urllib to open the website
response= urllib.urlopen("http://www.pythonchallenge.com/pc/def/ocr.html")
source = response.read()
response.close()

# 抓取到整个HTML的sourceprint source

# 得到凝视中的全部元素

data = re.findall(r'', source, re.S)
# 得到字母charList = re.findall(r'([a-zA-Z])', data[1], 16)print charListprint ''.join(charList)

终于的结果是

['e', 'q', 'u', 'a', 'l', 'i', 't', 'y']
equality

####################################################################################################################################

Python urllib库提供了一个从指定URL地址获取网页数据，然后进行分析的功能。

import urllib
google = urllib.urlopen('http://www.google.com')
print 'http header:
', google.info()
print 'http status:', google.getcode()
print 'url:', google.geturl()

# result

http header:
Date: Tue, 21 Oct 2014 19:30:35 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=521bc5021bb6e976:FF=0:TM=1413919835:LM=1413919835:S=7cbCQWnhLCPJFOiw; expires=Thu, 20-Oct-2016 19:30:35 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=mzfYCxoBC3d9VaQC6-cXKIcbxt4eekorvE6lon1ZHQhLeVxasD2oeRKEG2In90zRAqNPQ1xLfzR_ha1ife0JqdJankdexWaFjZiQN2mLGjavWCfMBYETbFfIst08iNtR; expires=Wed, 22-Apr-2015 19:30:35 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic,p=0.01

http status: 200
url: http://www.google.com

我们能够用urlopen抓取网页，然后read方法获得全部的信息。

info获取http header，返回一个httplib.HTTPMessage对象。表示远程server返回的头信息。

getcode获得http status。假设是http请求，200表示成功。404表示网址没找到。

geturl获得信息来源站点。

还有getenv获得环境变量。putenv环境变量设置。等等。

print help(urllib.urlopen)
#result
Help on function urlopen in module urllib:

urlopen(url, data=None, proxies=None)
    Create a file-like object for the specified URL to read from.

上述。我们能够知道，就是创建一个类文件对象为指定的url来读取。

參数url表示远程数据的路径。通常是http或者ftp路径

參数data表示以get或者post方法提交到url数据

參数proxies表示用于代理的设置

urlopen返回一个类文件对象

有read()，readline()。readlines()，fileno()。close()等和文件对象一样的方法

####################################################################################################################################

Python 中的re 正則表達式模块

re.match 字符串匹配模式

import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"

上述的代码的结果是

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

能够看出。group()返回整个match的对象。group(?)能够返回submatch，上述代码有两个匹配点。

主要函数语句 re.match(pattern, string, flags)

pattern就是写的regular expression用于匹配。

string就是传入的须要被匹配取值。

flags能够不写。能够用 | 分隔。

re.I 或者re.IGNORECASE，表示匹配部分大写和小写。case insensitively。

（Performs case-insensitive matching.）

re.S或者re.DOTALL，表示点随意匹配模式，改变'.'的行为，设置后能够匹配

（Makes a period (dot) match any character, including a newline.）

re.M或者re.MULTILINE，表示多行模式。改变'^'和'$'的行为

（Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).）

re.L或者re.LOCALE。使得提前定义字符类w,W, , B, s, S取决于当前区域设定

（Interprets words according to the current locale. This interpretation affects the alphabetic group (w and W), as well as word boundary behavior ( and B).）

re.U或者re.UNICODE，使得提前定义字符类w,W, , B, s, S取决于unicode定义的字符属性

（Interprets letters according to the Unicode character set. This flag affects the behavior of w, W, , B.）

re.X或者re.VERBOSE。具体模式。这个模式下正則表達式能够是多行。忽略空白字符，并能够增加凝视。

（Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.）

re.search v.s. re.match

import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print "match --> matchObj.group() : ", matchObj.group()
else:
   print "No match!!"

searchObj = re.search( r'dogs', line, re.M|re.I)
if searchObj:
   print "search --> searchObj.group() : ", searchObj.group()
else:
   print "Nothing found!!"
   
# result

No match!!
search --> searchObj.group() :  dogs

我们能够看出来，match是从头開始check整个string的，假设開始没找到就是没找到了。

而search寻找完整个string。从头到尾。

re.sub

详细的语句例如以下

re.sub(pattern, repl, string, max=0)

替换string全部的match部分为repl，替换全部的知道替换max个。

然后返回一个改动过的string。

import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num

# Remove anything other than digits
num = re.sub(r'D', "", phone)
print "Phone Num : ", num

# result

Phone Num :  2004-959-559 
Phone Num :  2004959559

re.split (pattern, string, maxsplit=0)

能够使用re.split来切割字符串。maxsplit是分离次数，maxsplit=1表示分离一次。默认是0，不限制次数。

import re

print re.split('W+', 'Words, words, words.')
print re.split('(W+)', 'Words, words, words.')
print re.split('W+', 'Words, words, words.', 1)

# result

['Words', 'words', 'words', '']
['Words', ', ', 'words', ', ', 'words', '.', '']
['Words', 'words, words.']

假设在字符串的开头或者结尾就匹配，那么返回的list会以空串開始或结尾。

import re

print re.split('(W+)', '...words, words...')

# result

['', '...', 'words', ', ', 'words', '...', '']

假设字符串不能匹配，就返回整个字符串的list。

import re

print re.split('a', '...words, words...')

# result

['...words, words...']

####

str.split('s') 和re.split('s',str)都是切割字符串，返回list。可是是有差别的。

1. str.split('s') 是字面上的依照's'来切割字符串

2. re.split('s', str)是依照空白来切割的。由于正則表達式中的‘s’就是空白的意思。

re.findall(pattern, string, flags=0)

找到re匹配的全部子串，并把它们作为一个列表返回。这个匹配从左到右有序的返回。假设没有匹配就返回空列表。

import re

print re.findall('a', 'bcdef')
print re.findall(r'd+', '12a34b56c789e')

# result

[]
['12', '34', '56', '789']

re.compile(pattern, flags=0)

编译正則表達式，返回RegexObject对象，然后通过RegexObject对象调用match方法或者search方法。

prog = re.compile(pattern)

result = prog.match(string)

等价

result = re.match(pattern, string)

第一种方法可以实现正则表达式的重用。