Python，小白自学爬虫

学习内容源自：博客园金角大王

－－－－－－－－－－－－－－－－－－－－－－－－－－－2018.7.22－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

Urllib库的基本使用　　

什么是Urllib？

Urllib是python内置的HTTP请求库
包括以下基础模块：
urllib.request　　请求模块
urllib.error　　　　异常处理模块
urllib.parse　　　　 url解析模块
urllib.robotparser　　robots.txt解析模块

urllib.request的使用　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　

urlopen

urllib.request.urlopen

完整参数：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

一般常用：urllib.requeset.urlopen(url,data,timeout)

url参数

简单栗子

1 import urllib.request
2 
3 response = urllib.request.urlopen('https://bj.ke.com/')
4 print(response.read().decode('utf-8'))

结果：抓取贝壳网首页的Javascript中的内容

response.read()可以获取到网页的内容，decode('utf-8')可以是抓取到的内容以utf-8的格式输出

data参数

网站：http://httpbin.org　

该网站可以用来模拟各种请求

栗子：

1 import urllib.parse
2 import urllib.request
3 
4 data = bytes(urllib.parse.urlencode({'word': 'shang'}), encoding='utf8')
5 print("data:",data)
6 response1 = urllib.request.urlopen('http://httpbin.org/post', data=data)
7 print(response1.read().decode('utf-8'))

备注：

1、urllib.parse　　url解析模块

2、通过bytes(urllib.parse.urlencode())可以将post数据进行转换放到urllib.request.urlopen的data参数中，这样就完成了一次post请求

3、添加data参数的时候就是以post请求方式请求，如果没有data参数就是get请求方式

4、GET和POST的区别就是：请求的数据GET是在url中，POST则是存放在头部；post请求对应的参数是data

GET:向指定的资源发出“显示”请求。使用GET方法应该只用在读取数据，而不应当被用于产生“副作用”的操作中，例如在Web Application中。其中一个原因是GET可能会被网络蜘蛛等随意访问

POST:向指定资源提交数据，请求服务器进行处理（例如提交表单或者上传文件）。数据被包含在请求本文中。这个请求可能会创建新的资源或修改现有资源，或二者皆有。

结果：

data: b'word=shang'
{"args":{},"data":"","files":{},"form":{"word":"shang"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.7"},"json":null,"origin":"183.240.196.58","url":"http://httpbin.org/post"}

timeout参数

请求时限，当在timeout时限内为获取到结果，则报异常

1 import urllib.request
2 
3 response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
4 print(response.read())

备注：get相比post，无需设置相应的请求参数data

结果：

b'{"args":{},"headers":{"Accept-Encoding":"identity","Connection":"close","Host":"httpbin.org","User-Agent":"Python-urllib/3.7"},"origin":"183.240.196.58","url":"http://httpbin.org/get"}\n'

若将timeout时间设置为0.1，则会出现超时，提示系统超时

抓取超时异常结果，代码更改为：

1 import socket
2 import urllib.request
3 import urllib.error
4 
5 try:
6     response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
7 except urllib.error.URLError as e:
8     if isinstance(e.reason, socket.timeout):
9         print('TIME OUT')

结果：TIME OUT

响应

　　1)、type(response)　　响应类型

1 import urllib.request
2 
3 response = urllib.request.urlopen('https://www.python.org')
4 print(type(response))

结果：<class 'http.client.HTTPResponse'>

　　2）、response.status　　响应状态码

1 import urllib.request
2 
3 response = urllib.request.urlopen('https://www.python.org')
4 print(response.status)

结果：200（响应成功）

　　3）、response.getheaders()　　获取头部信息

1 import urllib.request
2 response = urllib.request.urlopen('https://www.python.org')
3 print(response.getheaders())

结果：

[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '48812'), ('Accept-Ranges', 'bytes'), ('Date', 'Sun, 22 Jul 2018 15:41:38 GMT'), ('Via', '1.1 varnish'), ('Age', '2128'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2136-IAD, cache-hkg17930-HKG'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '4, 729'), ('X-Timer', 'S1532274099.759367,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

　　4）、response.getheader("server")　　获取头部sever的信息

1 import urllib.request
2 response = urllib.request.urlopen('https://www.python.org')
3 print(response.getheader("server"))

结果：nginx

　　5）、response.read()　　获得的是响应体的内容

略，见url参数栗子...

－－－－－－－－－－－－－－－－－－－－－－－－－－2018.7.23－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

request　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　

设置Headers

为避免程序爬虫，使网站瘫痪，需携带一些headers头部信息才能访问，最常见的有user-agent参数

栗子：