由于http协议是无状态协议(假如登录成功,当访问该网站的其他网页时,登录状态消失),此时,需要将会话信息保存起来,通过cookie或者session的方式
cookie
将所有的回话信息保存在客户端
session
将会话信息保存在服务端,但是服务端给客户端发的sessionid信息会保存在客户端的cookie里
Cookie 实战
python3.x
Cookiejar
python2.x
Cookielib
实例演示:
#!/usr/bin/env python #-*-coding:utf-8-*- import urllib.request import urllib.parse import http.cookiejar url = "http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LUPvX" postdata = urllib.parse.urlencode({"username":"weisuen","password":"aA123456"}).encode('utf-8') #使用urlencode处理,在设置为utf-8编码 req = urllib.request.Request(url,postdata) #构建request对象 req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3107.4 Safari/537.36") data = urllib.request.urlopen(req).read() #登陆并爬取网页 url2 = "http://bbs.chinaunix.net/" req2 = urllib.request.Request(url2,postdata) req2.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3107.4 Safari/537.36") data2 = urllib.request.urlopen(req2).read() #写入到文件 with open('1.html','wb') as one, open('2.html','wb') as two: one.write(data) two.write(data2)
将文件打开,1.html显示内容已经登陆成功,2.html显示没有登陆,这是由于没有设置cookie导致
添加cookie
思路:
A、导入Cookie处理模块http.cookiejar
B、使用http.cookiejar.CookieJar()创建CookieJar对象
C、使用HTTPCookieProcessor创建cookie处理器,并以其为参数构建opener对象
D、创建全局默认的opener对象
对以上代码修改为:
#!/usr/bin/env python #-*-coding:utf-8-*- import urllib.request import urllib.parse import http.cookiejar import sys,json # print (sys.getdefaultencoding()) url = "http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LUPvX" postdata = urllib.parse.urlencode({"username":"weisuen","password":"aA123456"}).encode('utf-8') req = urllib.request.Request(url,postdata) req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3107.4 Safari/537.36") #使用http.cookiejar.CookieJar()创建CookieJar对象 cookie_jar = http.cookiejar.CookieJar() ##创建cookieJar对象 #创建opener对象 opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar)) ##创建全局默认的opener对象 urllib.request.install_opener(opener) data = opener.open(req).read() #content = data.read() url2 = "http://bbs.chinaunix.net/" data2 = urllib.request.urlopen(url2).read() with open('3.html','wb') as one, open('4.html','wb') as two: one.write(data) two.write(data2)
此时打开3.html和4.html全部在登陆状态