Python中的urllib2模块解析

Name

urllib2 - An extensible library for opening URLs using a variety of protocols

 

1. Description

The simplest way to use this module is to call the urlopen function,which accepts a string containing a URL or a Request object . It opens the URL and returns the results as file-like object.

2. Classes

    exceptions.IOError(exceptions.EnvironmentError)

        URLError

            HTTPError(URLError, urllib.addinfourl)

    AbstractBasicAuthHandler

        HTTPBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)

        ProxyBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)

    AbstractDigestAuthHandler

    BaseHandler

        AbstractHTTPHandler

            HTTPHandler

            HTTPSHandler

        FTPHandler

            CacheFTPHandler

        FileHandler

        HTTPCookieProcessor

        HTTPDefaultErrorHandler

        HTTPDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler)

        HTTPErrorProcessor

        HTTPRedirectHandler

        ProxyDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler)

        ProxyHandler

        UnknownHandler

    HTTPPasswordMgr

        HTTPPasswordMgrWithDefaultRealm

    OpenerDirector

    Request

3. 两种访问网页模式:

模式1

  导入模块
  import urllib2
  发送请求
  request = urllib2.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False)
  打开request对象,返回服务器相应对象
  response = urllib2.urlopen(request)
  输出网页代码内容
  print response.read()
  通过构建一个request对象,服务器响应请求得到应答,这样显得逻辑上清晰明确。

模式2

  导入模块
  import urllib2
  打开url对象,返回服务器相应对象
  response = urllib2.urlopen(url, data=None, timeout=<object object>, cafile=None, capath=None, cadefault=False, context=None)
  输出网页代码内容
  print response.read()

4. 设置Headers

很多服务器或代理服务器会查看HTTP头,进而控制网络流量,实现负载均衡,限制不正常用户的访问。所以我们要学会设置HTTP头,来保证一些访问的实现。
代码如下:
  import urllib 
  import urllib2 
  url = 'http://www.server.com/login'
  user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 
  values = {'username' : 'cqc', 'password' : 'XXXX' } 
  headers = { 'User-Agent' : user_agent } 
  data = urllib.urlencode(values) 
  request = urllib2.Request(url, data, headers) 
  response = urllib2.urlopen(request) 
  page = response.read()


这样,我们设置了一个headers,在构建request时传入,在请求时,就加入了

headers传送,服务器若识别了是浏览器发来的请求,就会得到响应。

常见的User Agent

1.Android

Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19
Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1

2.Firefox

Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0
Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0

3.Google Chrome

Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36
Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19

5. 设置代理服务器

控制代理服务器,防止服务器限制IP。每隔一段时间换一个代理服务器。代理服务器的ip你可以从网页中自己选择和定期更换,控制代理服务器,每隔一段时间换一个代理服务器。代理服务器URL:http://www.xicidaili.com/
代码如下:
  import urllib2
  enable_proxy = True
  proxy_handler = urllib2.ProxyHandler({"http":"61.135.217.7:80"})
  null_proxy_handler = urllib2.ProxyHandler({})
  if enable_proxy:
    opener = urllib2.build_opener(proxy_handler)
  else:
    opener = urllib2.build_opener(null_proxy_handler)
    urllib2.install_opener(opener)

6. 超时设置

urlopen方法第三个参数就是timeout的设置,可以设置等待多久超时,为了解决一些网站实在响应过慢而造成的影响。
  import urllib2
  response = urllib2.urlopen('http://www.baidu.com', timeout=10)

原文地址:https://www.cnblogs.com/windyrainy/p/10592594.html