爬虫基础篇

1.爬虫相关概述

爬虫概念:

通过编写程序模拟浏览器上网,然后让其去互联网上爬取/抓取数据的过程
模拟:浏览器就是一款纯天然的原始的爬虫工具

爬虫分类:

通用爬虫:爬取一整张页面中的数据. 抓取系统(爬虫程序)
聚焦爬虫:爬取页面中局部的数据.一定是建立在通用爬虫的基础之上
增量式爬虫:用来监测网站数据更新的情况.以便爬取到网站最新更新出来的数据

风险分析

合理的的使用
爬虫风险的体现:
爬虫干扰了被访问网站的正常运营；
爬虫抓取了受到法律保护的特定类型的数据或信息。
避免风险:
严格遵守网站设置的robots协议；
在规避反爬虫措施的同时，需要优化自己的代码，避免干扰被访问网站的正常运行；
在使用、传播抓取到的信息时，应审查所抓取的内容，如发现属于用户的个人信息、隐私或者他人的商业秘密的，应及时停止并删除。

反爬机制

反反爬策略 
robots.txt协议:文本协议,在文本中指定了可爬和不可爬的数据说明.

常用的头信息

User-Agent:请求载体的身份标识
Connection:close
content-type

如何鉴定页面中是否有动态加载的数据?

局部搜索全局搜索

对一个陌生网站进行爬取前的第一步做什么?
确定你要爬取的数据是否为动态加载的!!!

2.requests模块的基本使用

requests模块
概念:一个机遇网络请求的模块.作用就是用来模拟浏览器发起请求
编码流程:
指定url
进行请求的发送
获取响应数据(爬取到的数据)
持久化存储

import requests
url = 'https://www.sogou.com'
#返回值是一个响应对象
response = requests.get(url=url)
#text返回的是字符串形式的响应数据
data = (response.text)
with open('./sogou.html',"w",encoding='utf-8') as f:
    f.write(data)

基于搜狗编写一个简易的网页采集器

解决乱码问题

解决UA检测问题

import requests

wd = input('输入key：')
url = 'https://www.sogou.com/web'
# 存储的就是动态的请求参数
params = {
    'query': wd
}
#params参数表示的是对请求url参数的封装
#headers 解决反爬机制，实现UA伪装
headers = {
    'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params,headers=headers)
#手动修改响应数据的编码，解决中文乱码
response.encoding = 'utf-8'

data = (response.text)
filename = wd + '.html'
with open(filename, "w", encoding='utf-8') as f:
    f.write(data)
print(wd, "下载成功")

1.爬取豆瓣电影的详细数据

分析

当滚轮滑动到底部的时候，发起ajax的请求，且请求到了一组电影数据
动态加载的数据:就是通过另一个额外的请求请求到的数据
ajax生成动态加载的数据
js生成动态加载的数据

import requests
limit = input("排行榜前多少的数据:::")
url = 'https://movie.douban.com/j/chart/top_list'
params = {
    "type": "5",
    "interval_id": "100:90",
    "action": "",
    "start": "0",
    "limit": limit
}

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
#json返回的是序列化好的对象
data_list = (response.json())

with open('douban.txt', "w", encoding='utf-8') as f:
    for i in data_list:
        name = i['title']
        score = i['score']
        f.write(name+""+score+""+"
")
print("成功")

2.爬取肯德基地理位置信息

import requests

url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
params = {
    "cname": "",
    "pid": "",
    "keyword": "青岛",
    "pageIndex": "1",
    "pageSize": "10"
}

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.post(url=url, params=params, headers=headers)
# json返回的是序列化好的对象
data_list = (response.json())
with open('kedeji.txt', "w", encoding='utf-8') as f:
    for i in data_list["Table1"]:
        name = i['storeName']
        addres = i['addressDetail']
        f.write(name + "," + addres  + "
")
print("成功")

3.爬取药品管理局数据

import requests

url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
with open('化妆品，txt', "w", encoding="utf-8") as f:
    for i in range(1, 5):
        params = {
            "on": "true",
            "page": str(i),
            "pageSize": "12",
            "productName": "",
            "conditionType": "1",
            "applyname": "",
            "applysn": ""
        }

        response = requests.post(url=url, params=params, headers=headers)
        data_dic = (response.json())

        for i in data_dic["list"]:
            id = i['ID']
            post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
            post_data = {
                "id": id
            }
            response2 = requests.post(url=post_url, params=post_data, headers=headers)
            data_dic2 = (response2.json())
            title = data_dic2["epsName"]
            name = data_dic2['legalPerson']

            f.write(title + ":" + name + "
")

3.数据解析

解析:根据指定的规则对数据进行提取

作用:实现聚焦爬虫

聚焦爬虫的编码流程:

指定url
发起请求
获取响应数据
数据解析
持久化存储

数据解析的方式:

正则
bs4
xpath
pyquery(拓展)

数据解析的通用原理是什么?

数据解析需要作用在页面源码中(一组html标签组成的)

html的核心作用是什么?

展示数据

html是如何展示数据的呢?

html所要展示的数据一定是被放置在html标签之中,或者是在属性中

通用原理:

1.标签定位
2.取文本or取属性

1.正则解析

1.爬取糗事百科糗图数据

爬取单张

import requests

url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
# content返回的是byte类型的数据
img_data = (response.content)
with open('./123.jpg', "wb") as f:
        f.write(img_data)
print("成功")

爬取单页

<div class="thumb">

<a href="/article/123319109" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12331/123319109/medium/MOX0YDFJX7CM1NWK.jpg" alt="糗事#123319109" class="illustration" width="100%" height="auto">
</a>
</div>

import re
import os
import requests

dir_name = "./img"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
url = "https://www.qiushibaike.com/imgrank/"

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
img_text = requests.get(url, headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
    src = "https:" + src
    img_name = src.split('/')[-1]
    img_path = dir_name + "/" + img_name
    response = requests.get(src, headers).content
    # 对图片地址发请求获取图片数据
    with open(img_path, "wb") as f:
        f.write(response)
print("成功")

爬取多页

import re
import os
import requests

dir_name = "./img"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
for i in range(1,5):
    url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
    print(f"正在爬取第{i}页的图片")
    img_text = requests.get(url, headers=headers).text
    ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
    img_list = re.findall(ex, img_text, re.S)
    for src in img_list:
        src = "https:" + src
        img_name = src.split('/')[-1]
        img_path = dir_name + "/" + img_name
        response = requests.get(src, headers).content
        # 对图片地址发请求获取图片数据
        with open(img_path, "wb") as f:
            f.write(response)
print("成功")

2.bs4解析

环境安装

pip install bs4

bs4的解析原理

实例化一个BeautifulSoup的对象为soup,并且将即将被解析的页面源码数据加载到该对象中,
调用BeautifulSoup对象中的相关属性和方法进行标签定位和数据提取

如何实例化BeautifulSoup对象呢?

BeautifulSoup(fp,'lxml'):专门用作于解析本地存储的html文档中的数据
BeautifulSoup(page_text,'lxml'):专门用作于将互联网上请求到的页面源码数据进行解析

标签定位

soup.tagName:定位到第一个TagName标签,返回的是第一个

属性定位

soup.find('div',class_='s'),返回值是class=s的div标签
find_all:和find用法一致,但是返回值是列表

选择器定位

select('选择器'),返回值为列表
	标签,类,id,层级(>一个层级,空格 多个层级)

提取数据

取文本

tag.string:标签中直系的文本内容
tag.text:标签中所有的文本内容

取属性

soup.find("a",id_='tt')['href']

1.爬取三国演义小说内容

http://www.shicimingju.com/book/sanguoyanyi.html

爬取章节名称+章节内容

1.在首页中解析章节名称&每一个章节详情页的url

from bs4 import BeautifulSoup
import requests

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
page_text = requests.get(url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')
a_list = soup.select(".book-mulu a")
with open('./sanguo.txt', 'w', encoding='utf-8') as f:
    for a in a_list:
        new_url = "http://www.shicimingju.com" + a["href"]
        mulu = a.text
        print(mulu)
        ##对章节详情页的url发起请求,解析详情页中的章节内容
        new_page_text = requests.get(new_url, headers).text
        new_soup = BeautifulSoup(new_page_text, 'lxml')
        neirong = new_soup.find('div', class_='chapter_content').text
        f.write(mulu+":"+neirong+"
")

3.xpath解析

环境安装

pip install lxml

xpath的解析原理

实例化一个etree类型xpath的解析原理的对象,且将页面源码数据加载到该对象中
需要调用该对象的xpath方法结合着不同形式的xpath表达式进行标签定位和数据提取

etree对象的实例化

tree = etree.parse(fileNane)
tree = etree.HTML(page_text)
xpath方法返回的永远是一个列表

标签定位

tree.xpath("")
在xpath表达式中最最侧的/表示的含义是说,当前定位的标签必须从根节点开始进行定位
xpath表达式中最左侧的//表示可以从任意位置进行标签定位
xpath表达式中非最左侧的//表示的是多个层级的意思
xpath表达式中非最左侧的/表示的是一个层级的意思

属性定位://div[@class='ddd']

索引定位://div[@class='ddd']/li[3] #索引从1开始
索引定位://div[@class='ddd']//li[2] #索引从1开始

提取数据

取文本:
tree.xpath("//p[1]/text()"):取直系的文本内容
tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文本内容
取属性:
tree.xpath('//a[@id="feng"]/@href')

1.爬取boss的招聘信息

from lxml import etree
import requests
import time


url = 'https://www.zhipin.com/job_detail/?query=python&city=101120200&industry=&position='
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    'cookie':'__zp__pub__=; lastCity=101120200; __c=1594792470; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1594713563,1594713587,1594792470; __l=l=%2Fwww.zhipin.com%2Fqingdao%2F&r=&friend_source=0&friend_source=0; __a=26925852.1594713563.1594713586.1594792470.52.3.39.52; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1594801318; __zp_stoken__=c508aZxdfUB9hb0Q8ORppIXd7JTdDTF96U3EdCDgIHEscYxUsVnoqdH9VBxY5GUtkJi5wfxggRDtsR0dAT2pEDDRRfWsWLg8WUmFyWQECQlYFSV4SCUQqUB8yfRwAUTAyZBc1ABdbRRhyXUY%3D'
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div/div[2]/ul/li')
for li in li_list:
    #需要将li表示的局部页面源码数据中的相关数据进行提取
    #如果xpath表达式被作用在了循环中, 表达式要以. / 或者. // 开头
    detail_url = 'https://www.zhipin.com' + li.xpath('.//span[@class="job-name"]/a/@href')[0]
    job_title = li.xpath('.//span[@class="job-name"]/a/text()')[0]
    company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()')[0]
    # # 对详情页的url发请求解析出岗位职责
    detail_page_text = requests.get(detail_url, headers=headers).text
    tree = etree.HTML(detail_page_text)
    job_desc = tree.xpath('//div[@class="text"]/text()')
    #列表转字符传
    job_desc = ''.join(job_desc)
    print(job_title,company,job_desc)
    time.sleep(5)

2.爬取糗事百科

爬取作者，和文章。注意作者有匿名和实名之分

from lxml import etree
import requests


url = "https://www.qiushibaike.com/text/page/4/"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
print(div_list)

for div in div_list:
#用户名分为匿名用户和注册用户
    author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
    content = div.xpath('.//div[@class="content"]/span//text()')
    content = ''.join(content)
    print(author, content)

3.爬取网站图片

from lxml import etree
import requests
import os
dir_name = "./img2"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
for i in range(1, 6):
    if i == 1:
        url = "http://pic.netbian.com/4kmeinv/"
    else:
        url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"

    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
    for li in li_list:
        img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
        img_name = li.xpath('./a/b/text()')[0]
        #解决中文乱码
        img_name = img_name.encode('iso-8859-1').decode('gbk')
        response = requests.get(img_src).content
        img_path = dir_name + "/" + f"{img_name}.jpg"
        with open(img_path, "wb") as f:
            f.write(response)
    print(f"第{i}页成功")

4.IP代理

代理服务器

实现请求转发,从而可以实现更换请求的ip地址

代理的匿名度

透明:服务器知道你使用了代理并且知道你的真实ip
匿名:服务器知道你使用了代理,但是不知道你的真实ip
高匿:服务器不知道你使用了代理,更不知道你的真实ip

代理的类型

http:该类型的代理只可以转发http协议的请求

https:只可以转发https协议的请求

免费代理ip的网站

快代理
西祠代理
goubanjia
代理精灵(推荐):http://http.zhiliandaili.cn/

在爬虫中遇到ip被禁掉如何处理?

使用代理
构建一个代理池
拨号服务器

import requests
import random
from lxml import etree

# 列表形式的代理池
all_ips = []
proxy_url = "http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
proxy_page_text = requests.get(url=proxy_url, headers=headers).text
tree = etree.HTML(proxy_page_text)
proxy_list = tree.xpath('//body//text()')
for ip in proxy_list:
    dic = {'https': ip}
    all_ips.append(dic)
# 爬取快代理中的免费代理ip
free_proxies = []
for i in range(1, 3):
    url = f"http://www.kuaidaili.com/free/inha/{i}/"
    page_text = requests.get(url, headers=headers,proxies=random.choice(all_ips)).text
    tree = etree.HTML(page_text)
    # xpath表达式中不可以出现tbody
    tr_list = tree.xpath('//*[@id="list"]/table/tbody/tr')
    for tr in tr_list:
        ip = tr.xpath("./td/text()")[0]
        port = tr.xpath("./td[2]/text()")[0]
        dic = {
            "ip":ip,
            "port":port
        }
        print(dic)
        free_proxies.append(dic)
    print(f"第{i}页")
print(len(free_proxies))

5.处理cookie

视频解析接口

https://www.wocao.xyz/index.php?url=
https://2wk.com/vip.php?url=
https://api.47ks.com/webcloud/?v-

视频解析网址

牛巴巴     http://mv.688ing.com/
爱片网     https://ap2345.com/vip/
全民解析   http://www.qmaile.com/

回归正点

为什么要处理cookie？

保存客户端的相关状态

在请求中携带cookie,在爬虫中如果遇到了cookie的反爬如何处理?

#手动处理
在抓包工具中捕获cookie,将其封装在headers中

#自动处理
使用session机制
使用场景:动态变化的cookie
session对象:该对象和requests模块用法几乎一致.如果在请求的过程中产生了cookie,如果该请求使用session发起的,则cookie会被自动存储到session中

爬去雪球网的数据

import requests

s = requests.Session()
main_url = "https://xueqiu.com"  # 先对url发请求获取cookie
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
params = {
    "size": "8",
    '_type': "10",
    "type": "10"
}
s.get(main_url, headers=headers)
url = 'https://stock.xueqiu.com/v5/stock/hot_stock/list.json?size=8&_type=10&type=10'

page_text = s.get(url, headers=headers).json()
print(page_text)

6.验证码识别

相关的线上打码平台识别

打码兔
云打码
超级鹰：http://www.chaojiying.com/about.html

1.注册,登录(用户中心的身份认证)

2.登录后

创建一个软件:软件ID->生成一个软件id

下载示例代码:开发文档->python->下载

平台实例代码的演示

import requests
from hashlib import md5


class Chaojiying_Client(object):
    def __init__(self, username, password, soft_id):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                          headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


chaojiying = Chaojiying_Client('超级鹰用户名', '超级鹰用户名的密码', '96001')
im = open('a.jpg', 'rb').read()
print(chaojiying.PostPic(im, 1902)['pic_str'])

将古诗网中的验证码进行识别

zbb.py

import requests
from hashlib import md5


class Chaojiying_Client(object):
    def __init__(self, username, password, soft_id):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                          headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


def www(path,type):
    chaojiying = Chaojiying_Client('5423', '521521', '906630')
    im = open(path, 'rb').read()
    return chaojiying.PostPic(im, type)['pic_str']

requests.py

import requests
from lxml import etree
from zbb import www

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = requests.get(img_url,headers=headers).content
with open('./111.jpg','wb') as f:
    f.write(img_data)
img_text = www('./111.jpg',1004)
print(img_text)

7.模拟登陆

为什么在爬虫中需要实现模拟登录?

有的数据是必须经过登录后才可以显示出来的

古诗网

涉及到的反扒机制

1.验证码
2.动态请求参数:每次请求对应的请求参数都是动态变化
	动态捕获:通常情况下,动态的请求参数都会被隐藏在前台页面的源码中
3.cookie存在验证码图片之中 
 坑壁玩意

import requests
from lxml import etree
from zbb import www

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
# 获取cookie
s = requests.Session()
# s_url = "https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx"
# s.get(s_url, headers=headers)

# 获取验证码
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_url, headers=headers).content
with open('./111.jpg', 'wb') as f:
    f.write(img_data)
img_text = www('./111.jpg', 1004)
print(img_text)

# 动态捕获动态的请求参数
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]

# 点击登录按钮后发起请求的url:通过抓包工具捕获
login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
data = {
    "__VIEWSTATE": __VIEWSTATE,
    "__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR,  # 变化的
    "from": "http://so.gushiwen.cn/user/collect.aspx",
    "email": "542154983@qq.com",
    "pwd": "zxy521",
    "code": img_text,
    "denglu": "登录"
}
main_page_text = s.post(login_url, headers=headers, data=data).text
with open('main.html', 'w', encoding='utf-8') as fp:
    fp.write(main_page_text)

8.基于线程池的异步爬取

基于线程池的异步爬取趣味百科前十页内容

import requests
from multiprocessing.dummy import Pool

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
#将url获取，加入列表之中
urls = []
for i in range(1, 11):
    urls.append(f'https://www.qiushibaike.com/8hr/page/{i}/')

#创建一个request请求
def get_request(url):
    # 必须只能有一个参数
    return requests.get(url, headers=headers).text
#实例化线程10个
pool = Pool(10)
response_text_list = pool.map(get_request,urls)
print(response_text_list)

9.单线程+多任务异步协程

1.简介

协程:对象

#可以把协程当做是一个特殊的函数.如果一个函数的定义被async关键字所修饰.该特殊的函数被调用后函数内部的程序语句不会被立即执行,而是会返回一个协程对象.

任务对象(task)

#所谓的任务对象就是对协程对象的进一步封装.在任务对象中可以实现显示协程对象的运行状况.
#任务对象最终是需要被注册到事件循环对象中.

绑定回调

#回调函数是绑定给任务对象,只有当任务对象对应的特殊函数被执行完毕后,回调函数才会被执行

事件循环对象

#无限循环的对象.也可以把其当成是某一种容器.该容器中需要放置多个任务对象(就是一组待执行的代码块).

异步的体现

#当事件循环开启后,该对象会安装顺序执行每一个任务对象,
    #当一个任务对象发生了阻塞事件循环是不会等待,而是直接执行下一个任务对象

await:挂起的操作.交出cpu的使用权

单任务

from time import sleep
import asyncio


# 回调函数:
# 默认参数:任务对象
def callback(task):
    print('i am callback!!1')
    print(task.result())  # result返回的就是任务对象对应的那个特殊函数的返回值


async def get_request(url):
    print('正在请求:', url)
    sleep(2)
    print('请求结束:', url)
    return 'hello bobo'


# 创建一个协程对象
c = get_request('www.1.com')
# 封装一个任务对象
task = asyncio.ensure_future(c)

# 给任务对象绑定回调函数
task.add_done_callback(callback)

# 创建一个事件循环对象
loop = asyncio.get_event_loop()
loop.run_until_complete(task)  # 将任务对象注册到事件循环对象中并且开启了事件循环

2.多任务的异步协程

import asyncio
from time import sleep
import time
start = time.time()
urls = [
    'http://localhost:5000/a',
    'http://localhost:5000/b',
    'http://localhost:5000/c'
]
#在待执行的代码块中不可以出现不支持异步模块的代码
#在该函数内部如果有阻塞操作必须使用await关键字进行修饰
async def get_request(url):
    print('正在请求:',url)
    # sleep(2)
    await asyncio.sleep(2)
    print('请求结束:',url)
    return 'hello bobo'

tasks = [] #放置所有的任务对象
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start)

注意事项:

1.将多个任务对象存储到一个列表中,然后将该列表注册到事件循环中.在注册的过程中,该列表需要被wait方法进行处理.
2.在任务对象对应的特殊函数内部的实现中,不可以出现不支持异步模块的代码,否则就会中断整个的异步效果.并且,在该函数内部每一组阻塞的操作都必须使用await关键字进行修饰.
3.requests模块对应的代码不可以出现在特殊函数内部,因为requests是一个不支持异步的模块

3.aiohttp

支持异步操作的网络请求的模块

- 环境安装:pip install aiohttp

import asyncio
import requests
import time
import aiohttp
from lxml import etree

urls = [
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
]


# 无法实现异步的效果:是因为requests模块是一个不支持异步的模块
async def req(url):
    async with aiohttp.ClientSession() as s:
        async with await s.get(url) as response:
            # response.read():byte
            page_text = await response.text()
            return page_text

    # 细节:在每一个with前面加上async,在每一步的阻塞操作前加上await


def parse(task):
    page_text = task.result()
    tree = etree.HTML(page_text)
    name = tree.xpath('//p/text()')[0]
    print(name)


if __name__ == '__main__':
    start = time.time()
    tasks = []
    for url in urls:
        c = req(url)
        task = asyncio.ensure_future(c)
        task.add_done_callback(parse)
        tasks.append(task)

    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))

    print(time.time() - start)

10.selenium

概念

基于浏览器自动化的一个模块.

环境的安装:

下载selenium模块

selenium和爬虫之间的关联是什么?

便捷的获取页面中动态加载的数据
     requests模块进行数据爬取:可见非可得
     selenium:可见即可得
实现模拟登录

基本操作:

谷歌浏览器驱动程序下地址:
http://chromedriver.storage.googleapis.com/index.html

selenium驱动程序和谷歌版本的映射关系表:
https://blog.csdn.net/huilan_same/article/details/51896672

动作链

一系列的行为动作

无头浏览器

无可视化界面的浏览器
phantosJS

1.京东基本操作示例

from selenium import webdriver
from time import sleep
#1.实例化一个浏览器对象
bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
#2.模拟用户发起请求
url = 'https://www.jd.com'
bro.get(url) 
#3.标签定位
search_input =  bro.find_element_by_id('key')
#4.对指定标签进行数据交互
search_input.send_keys('华为')
#5.系列的行为动作
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
#6.执行js代码
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
sleep(3)
#7.关闭
bro.quit()

2.爬取药品总局信息

from selenium import webdriver
from lxml import etree
from time import sleep

page_text_list = []
# 实例化一个浏览器对象
bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
url = 'http://125.35.6.84:81/xk/'
bro.get(url)
# 必须等待页面加载完毕
sleep(2)
# page_source就是浏览器打开页面的源码数据

page_text = bro.page_source
page_text_list.append(page_text)
#必须要与窗口对应，窗口必须要显示点击按钮才可
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
#打开后两页的
for i in range(2):
    bro.find_element_by_id('pageIto_next').click()
    sleep(2)

    page_text = bro.page_source
    page_text_list.append(page_text)

for p in page_text_list:
    tree = etree.HTML(p)
    li_list = tree.xpath('//*[@id="gzlist"]/li')
    for li in li_list:
        name = li.xpath('./dl/@title')[0]
        print(name)
sleep(2)
bro.quit()

3.动作链

from lxml import etree
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains

# 实例化一个浏览器对象
page_text_list = []
bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro.get(url)
# 如果定位的标签是存在于iframe对应的子页面中的话,在进行标签定位前一定要执行一个switch_to的操作
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_id('draggable')

# 1.实例化动作链对象
action = ActionChains(bro)
action.click_and_hold(div_tag)

for i in range(5):
    #perform让动作链立即执行
    action.move_by_offset(17, 0).perform()
    sleep(0.5)
#释放
action.release()

sleep(3)

bro.quit()

4.处理反爬selenium

像淘宝很多网站都禁止selenium爬取

正常在浏览器输入window.Navigator.webdriver返回的是undefined

用代码打开浏览器返回的是true

from selenium import webdriver
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

#实例化一个浏览器对象
bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe',options=option)
bro.get('https://www.taobao.com/')

5.模拟12306登录

from selenium import webdriver
from selenium.webdriver import ActionChains
from PIL import Image  # 用作于图片的裁剪 pillow
from zbb import www
from time import sleep

bro =webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
sleep(5)
zhdl = bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a')
zhdl.click()
sleep(1)

username = bro.find_element_by_id('J-userName')
username.send_keys('181873')
pwd = bro.find_element_by_id('J-password')
pwd.send_keys('zx1')
# 验证码图片进行捕获(裁剪)
bro.save_screenshot('main.png')
# 定位到了验证码图片对应的标签
code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
location = code_img_ele.location  # 验证码图片基于当前整张页面的左下角坐标
size = code_img_ele.size  # 验证码图片的长和宽
# 裁剪的矩形区域(左下角和右上角两点的坐标)
rangle = (
int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))

i = Image.open('main.png')
frame = i.crop(rangle)
frame.save('code.png')

# # 使用打码平台进行验证码的识别
result = www('./code.png',9004)
  # x1,y1|x2,y2|x3,y3  ==> [[x1,y1],[x2,y2],[x3,y3]]
all_list = []  # [[x1,y1],[x2,y2],[x3,y3]] 每一个列表元素表示一个点的坐标,坐标对应值的0,0点是验证码图片左下角
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)
print(all_list)
action = ActionChains(bro)
for l in all_list:
    x = l[0]
    y = l[1]
    action.move_to_element_with_offset(code_img_ele, x, y).click().perform()
    sleep(2)

btn = bro.find_element_by_xpath('//*[@id="J-login"]')
btn.click()


action.release()
sleep(3)
bro.quit()