http://code.google.com/p/pyv8/, pyv8爬虫专用

http://code.google.com/p/pyv8/

http://wwwsearch.sourceforge.net/mechanize/

http://code.google.com/p/multi-mechanize/

http://xiudaima.appspot.com/code/detail/14001

"""

urllib2是个非常好用的http客户端库,但是用来写爬虫可能会遇到一些问题,

一是连续请求某一网站的时候,cookie需要手动添加到HTTP请求头里,

二是手动处理Referer页。这里贴一段代码,可以自动处理Cookie和Referer问题。

每一个请求会自动添加上一次的cookie和 referer。

参考了部分ClientCookie库的代码。

"""

import urllib2

class HTTPRefererProcessor(urllib2.BaseHandler):

    def __init__(self):

        self.referer = None 

    def http_request(self, request):

        if ((self.referer is not None) and

            not request.has_header("Referer")):

            request.add_unredirected_header("Referer", self.referer)

        return request 

    def http_response(self, request, response):

        self.referer = response.geturl()

        return response 

    https_request = http_request

    https_response = http_response 

def main():

    cj = CookieJar()

    opener = urllib2.build_opener(

        urllib2.HTTPCookieProcessor(cj),

        HTTPRefererProcessor(),

    )

    urllib2.install_opener(opener) 

    urllib2.urlopen(url1)  #打开第一个网址

    urllib2.urlopen(url2)  #打开第二个网址

if "__main__" == __name__:

    main()


CookieJar.extract_cookies(responserequest)

Extract cookies from HTTP response and store them in the CookieJar, where allowed by policy.

The CookieJar will look for allowable Set-Cookie and Set-Cookie2 headers in the response argument, and store cookies as appropriate (subject to the CookiePolicy.set_ok() method’s approval).

The response object (usually the result of a call to urllib2.urlopen(), or similar) should support an info() method, which returns an object with a getallmatchingheaders() method (usually a mimetools.Message instance).

The request object (usually a urllib2.Request instance) must support the methods get_full_url()get_host()unverifiable(), andget_origin_req_host(), as documented by urllib2. The request is used to set default values for cookie-attributes as well as for checking that the cookie is allowed to be set.

http://fly5.com.cn/p/p-like/python_https.html

http://www.cnblogs.com/xiaoxia/archive/2010/08/04/1792461.html?login=1

http://xiudaima.appspot.com/code/detail/14001

原文地址:https://www.cnblogs.com/lexus/p/1851677.html