Python Web 访问

1. 获取web页面

读取一个页面：

import urllib2

req = urllib2.Request('http://www.python.org')

page = urllib2.urlopen(req)

for line in page:

sys.stdout.write(line)

如果Request的网址没带协议的话会报错。

可以使用info()方法获得网页的headers：

import urllib2

req = urllib2.Request('http://www.python.org/')

page = urllib2.urlopen(req)

info = page.info()

print info

执行结果：

>>>

Date: Fri, 12 Jun 2009 13:07:11 GMT

Server: Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_ssl/2.2.9 OpenSSL/0.9.8g mod_wsgi/2.3 Python/2.5.2

Last-Modified: Fri, 12 Jun 2009 10:01:57 GMT

ETag: "105800d-43bd-46c23cc794f40"

Accept-Ranges: bytes

Content-Length: 17341

Connection: close

Content-Type: text/html

2. 认证

dump_info_auth.py展示了如何使用urllib2打开需要验证的页面。

3. 提交表单数据

GET方法，可以手工构造url。也可以使用urllib的urlencode方法：

import urllib2, urllib

url = 'http://www.wunderground.com/cgi-bin/findweather/getForecast'

url = url + '?' + urllib.urlencode([('query','shenyang')])

#print url

req = urllib2.Request(url)

page = urllib2.urlopen(req)

info = page.info()

print info

POST方法，与GET方法不同，不能手工构造查询字符串。而需要将数据作为参数传递给urlopen()方法。

import urllib2, urllib

url = 'http://www.wunderground.com/cgi-bin/findweather/getForecast'

data = urllib.urlencode([('query','shenyang')])

req = urllib2.Request(url)

page = urllib2.urlopen(req,data)

info = page.info()

print info

4. 处理错误

error_all.py在连接的过程中捕获异常，并且检查文档的长度和Content-Length是否一致。

第7章解析HTML和XHTML

使用Python自带的HTMLParser模块。下面的程序就可以获得一个文档的title以及标签个数。

# -*- coding: cp936 -*-

from HTMLParser import HTMLParser

import urllib2

#解析网页的title

class TitleParser(HTMLParser):

def __init__(self):

#title的数据

self.title = ''

self.readingtitle = 0

self.count = 0

HTMLParser.__init__(self)

def handle_starttag(self, tag, attrs):

self.count += 1

if tag == 'title':

self.readingtitle = 1

def handle_data(self, data):

if self.readingtitle:

self.title = data

def handle_endtag(self, tag):

if tag == 'title':

self.readingtitle = 0