python之路 -- 爬虫 -- 爬虫基础

爬虫最常用的模块：requests

Python标准库中提供了：urllib、urllib2、httplib等模块以供Http请求，但是他不好用。一般情况下我们都会使用一个第三方模块requests来发送http请求。requests发送请求的方式一般有2 种，get请求和post请求.

requests的安装

#cmd下执行
pip install requests

或者通过下载requests包之后再安装，一般推荐使用第一种方法。

1.get请求

 1 import requests   #引用requests模块
 2 #通过发送get请求获取百度主页页面
 3 ret = requests.get("www.baidu.com")
 4 ret.encoding = "utf-8"  #编码方式
 5 print(ret.text)   #打印返回的页面文本，是html格式文本
 6 
 7 #res.content返回页面以字节形式显示的内容
 8 
 9 ret.encoding = ret.apparent_encoding
10 #可以使页面的编码不会出现乱码

requests的get请求可以带参数：url 、[params]、[headers]、[cookies]

params参数的使用方法

1 import requests
2 requests.get(
3 url=”http://www.oldboyedu.com”,
4 params={“nid”:1,”name”:”xx”}    
5 #实际上传入的url为http://www.oldboyedu.com？Nid=1&name=xx

2.post请求

import requests

url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
headers = {'content-type': 'application/json'}

ret = requests.post(url, data=json.dumps(payload), headers=headers)
print(ret.text)
print(ret.cookies)

更多详细参数可以进入python官网查看

简单的反防爬虫：模拟浏览器登录

在发送请求是带上请求头headers:

常用请求头中的内容有headers={“User-Agent”:”...”，“host”：“...”，”referer”:”....”，自定义的，等等}

获取cookies:（返回的时机）

cookie_dict = res.cookies.get_dict()

认证：返回的认证token

自动登录抽屉

 1 import requests
 2 
 3 url="https://dig.chouti.com/all/hot/recent/1"
 4 header={
 5      "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
 6         }
 7 
 8 response = requests.get(
 9     url=pageurl,
10     headers = header
11 )
12 cookie1_dict = response.cookies.get_dict()
13 
14 # 发送post请求，进行登录
15 data = {
16     "phone":"8613121758648",
17     "password":"woshiniba",
18     "oneMonth":1
19 }
20 response1 = requests.post(url="https://dig.chouti.com/login",
21                          data=data,
22                          headers=header,
23                           cookies = cookie1_dict
24                          )
25 
26 print(response1.text)#打印登录成功后的页面

View Code