【3】数据筛选2

#1.概述 - requests 号称是唯一的一个非转基因的 python http 库，人类可以安全享用遵循 PEP20 的核心规范 ⚫ Beautiful is better than ugly[美丽优于丑陋] ⚫ Explicit is better than implicit[直白优于含蓄] ⚫ Simple is better than complex[简单优于复杂] ⚫ Complex is better than comlicated[复杂由于繁琐] ⚫ Readability counts[可读性] 在线库： https://pypi.org/project/requests/ 官方文档： http://www.python-requests.org/en/master/ 官方中文文档： http://docs.python-requests.org/zh_CN/latest/index.html
#2.下载安装 - (1) 命令安装方式 windows:打开命令行窗口，直接运行包管理命令安装 pip install requests or essy_install requsets unix/linux:打开 shell 窗口，运行包管理命令安装 pip install requests - (2) 离线包安装下载离线安装包 requests-2.20.0-py2.py3-none-any.whl(见附件)，执行命令安装 pip install requests-2.20.0-py2.py3-none-any.whl
#3.入门程序 ``` # 引入依赖包 import requests # 发送请求获取服务器数据 response = requests.get("http://www.sina.com.cn") # 得到数据 print(response.text) ``` - request.get: 用于发送一个 get 请求给服务器，可以得到服务器的响应数据 - response.text: 从响应对象中获取文本数据
#4.请求对象：请求方式 - HTTP1.1 规范中定义了 8 中请求操作方式: GET|POST|PUT|DELETE|OPTION|CONNECT|HEAD|PATCH - 时下 WEB 项目比较流行的 RESTFul 请求方式有四种： GET|POST|PUT|DELETE 常规 WEB 项目最常用的请求方式有两种： GET|POST ⚫ requests.request(method, url, **kwargs): ◼ 底层发送请求的操作方式 ⚫ requests.get(url, params=None, **kwargs): ◼ 发送 GET 请求 ⚫ requests.post(url, data=None, json=None, **kwargs): ◼ 发送 POST 请求 ⚫ requests.put(url, data=None, **kwargs): ◼ 发送 PUT 请求 ⚫ requests.delete(url, **kwargs): ◼ 发送 DELETE 请求 ⚫ requests.patch(url, data=None, **kwargs): ◼ 发送 PATCH 请求 ⚫ requests.options(url, **kwargs): ◼ 发送 OPTIONS 请求 ⚫ requests.head(url, **kwargs): ◼ 发送 HEAD 请求
source code: requests.api.py
#5.请求对象： GET 参数传递 - requests.get(url, params=None, **kwargs): 发送 GET 请求 @param url: get 请求服务器的地址 @param params: get 请求中附带的参数 @param kwargs: 其他附带参数，详情参照 requests.request()源代码 ``` import requests target_url = 'http://www.baidu.com/s' data = {'wd': '魔道祖师'} response = requests.get(target_url, params=data) print(response.text) ``` - get 请求方式要传递的参数是字典形式的数据，直接赋值给 params 参数即可
#6.请求对象： POST 参数传递 - requests.post(url, data=None, json=None, **kwargs): 发送 POST 请求 @param url: post 请求服务器的 url 地址 @param data: post 请求中包含的常规参数数据 @param json: post 请求中包含的 like dict 数据/json 参数数据 ``` # 引入依赖的模块 import requests # 定义目标 url 地址 # url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule' url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule' # 传递 post 中包含的参数 data = { "i":"hello","from":"AUTO","to":"AUTO","smartresult":"dict","client":"fanyideskweb", "salt":"1541660576025","sign":"4425d0e75778b94cf440841d47cc64fb", "doctype":"json","version":"2.1","keyfrom":"fanyi.web", "action":"FY_BY_REALTIME","typoResult":"false", } # 发送请求获取服务器返回的响应数据 response = requests.post(url, data=data) print(response.text) ```
#7.请求对象：定制请求头 - requests 模块中的请求，底层都是通过 requsets.request(url, **kwargs)进行处理的，可以将请求头中的数据包含在字典中，传递给发送请求的 headers 参数进行操作 ``` # 引入依赖的模块 import requests from fake_useragent import UserAgent ua = UserAgent() # 定义请求地址和请求头数据 url = 'http://www.baidu.com/s' headers = {'User-agent': ua.random} param = {'wd': 'PYTHON 爬虫'} # 发送请求获取响应数据 response = requests.get(url, headers=headers) print(response.text) ```
##8.请求对象： cookie - WEB 开发中 cookie 经常被用于基于客户端的状态保持操作，所以在常规爬虫处理过程中,cookie 的操作是最重要的操作之一。 - 可以直接通过定义一个字典数据传递给 requests 请求模块的 cookies 参数，添加请求中的 cookie 数据 - requests 中提供的 requests.cookies.RequestCookieJar()也可以在请求中添加处理 cookie 数据的操作并且更加适合跨域场景 ```import requests from requests.cookies import RequestsCookieJar url = 'http://httpbin.org/cookies' # 1. 第一种直接定义方式 # cookie_data = {'name': 'jerry'} # 2. 对象操作方式 cookie_data = RequestsCookieJar() cookie_data.set('name', 'tom') response = requests.get(url, cookies=cookie_data) response.encoding = 'utf-8' print(response.text) ```
##9.响应对象 - 爬虫从网络上采集数据，采集到的数据主要区分为如下几种类型：文本数据、二进制数据 requests 模块在响应对象中，针对返回的数据进行了不同的封装处理 ⚫ response.encoding: 设置响应数据的编码，可以直接赋值
response.encoding = ‘utf-8’ ⚫ response.text: 获取响应对象中包含的文本数据 ⚫ response.content: 获取响应对象中包含的二进制数据 ⚫ response.json(): 获取响应对象中的 JSON 数据，数据必须正确解析，负责 raise ValueError ⚫ response.raw: 特殊情况下直接获取底层 socket 数据流，此时请求中必须设置参数 stream=True 表示允许数据流处理 ⚫ response.headers: 响应对象的响应头数据 ⚫ response.status_code: 响应对象中的响应状态码 ⚫ response.cookie:获取响应对象中包含的 cookie 数据

【3】数据筛选2

目录