数据解析
可以实现的四种途径:
- re(正则)
- bs4
- xpath
概述:数据解析就是将一组数据中的局部数据进行爬取;
作用:数据解析使用来实现聚焦爬虫。
数据解析的通用原理
在网页中 HTML 的数据存储在HTML 标签或者属性当中。我们可以采取标签定位,或者选取文本或者属性的。
我们举几个例子:
# 方法一
import requests
import urllib
# headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
# url = "http://pic.duanziwang.com/Uploads/Images/81/56b4746386883.png"
# pic_data = requests.get(url=url, headers=headers).content #content 返回的数据是二进制的数据
# print(pic_data)
# with open('./1.png','wb')as f:
# f.write(pic_data)
# 方法二
#
url = "http://pic.duanziwang.com/Uploads/Images/81/56b4746386883.png"
urllib.request.urlretrieve(url=url,filename='./2.jpg')
这两种方法都可以爬取内容,但是方法一可以使用UA伪装,方式二不行。
抓包工具response和Element选项显示的页面源码有什么区别。
Element:显示页码内容是当前页面加载完毕后的对应的所有数据包的所有数据(包含动态加载的数据)
response:仅仅是当前一个请求请求到的数据(不包含动态加载的数据)
下面来实现一个图片的爬取使用正则实现路径解析:
import os
import re
import urllib
import requests
url = "http://duanziwang.com/pic/5439/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
page_text = requests.get(url=url, headers=headers).text
dirname = "imglist"
if not os.path.exists(dirname):
os.mkdir(dirname)
ex = '<dl.*?< img src="(.*?)"</dl>'
img_list = re.findall(ex,page_text)
for i in img_list:
imgName = i.split("/")[-1]
image_path = dirname+"/"+imgName
urllib.request.urlretrieve(url=url,filename=image_path)
print(imgName,"下载充公")
bs4
解析原理:
实例化一个Beautifulsoup的一个对象,把即将解析的页面源码内容加载到这个对象当中。
利用beautifulsoup对象中相关的方法和属性对标签的定位以及本文的数据提取
beautiful 对象的实例化方式:
-beautifulsoup(fp,"lxml"):将本地文件内容加载到该对象中进行数据解析。
-beautifulsoup(page_text,"lxml"):将互联网上的请求数据加载到该对象中进行数据解析
bs4解析的相关操作
标签定位:返回值一定是定位到的标签
- soup.tageName 定位到的是第一个tagename标签,返回的是单数
- 属性定位:soup.find('tageName',attrName='value'),返回的是单数
- find_all("tageName",attrName='value')返回的是复数列表的形式。
- 选择器定位:select(“选择器”),返回的也是一个列表
- 层级选择器:大于号,表示一个层级
- 空格:标识多个层级
- 取文本string:只可以将标签的直系文取出。text 可以将标签中所有内容取出。
- 取属性:tag[attrName]
举例子:
from bs4 import BeautifulSoup
import requests
url = "http://duanziwang.com/pic/5440/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
page_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(page_text,"lxml")
# print(soup)
# print(soup.find("div",class_="nr"))
# print(soup.find("div",id="content"))
# print(soup.find_all("dl" ,class_="xhlist"))
# print(soup.select("#content"))
# print(soup.select(".xhlist>dd>img"))
# print(soup.select(".xhlist dd"))
# print(soup.title)
# print(soup.title)
# lis = soup.select(".xhlist>dd")
# print(lis[6].text)
# div_text = soup.find("div",class_="content").text
# print(div_text)
# ss = soup.select(".a_type>a")[0]
# print(ss["href"])
----------------------------------------------------------------------------------------------------------------------------------------------------
循环下载图片
from bs4 import BeautifulSoup
import requests
import os
import time
import urllib
url = "http://duanziwang.com/pic/5440/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
page_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(page_text,"lxml")
ss = soup.select(".xhlist>dd>img")
dirname = "imglist"
if not os.path.exists(dirname):
os.mkdir(dirname)
# for i in ss:
# print(i)
# imgName = i["src"].split("/")[-1]
# image_path = dirname + "/" + imgName ,#没有UA伪装识别不出图片,或者下载图片会损毁
# urllib.request.urlretrieve(url=url, filename=image_path)
# print(imgName, "下载充公")
# time.sleep(3)
for i in ss:
print(i)
imgName = i["src"].split("/")[-1]
image_path = dirname + "/" + imgName
pic_data = requests.get(url=i["src"], headers=headers).content # content 返回的数据是二进制的数据
with open(f'./{image_path}', 'wb')as f:
f.write(pic_data)
print(imgName, "下载充公")
time.sleep(3)
下载小说:
from bs4 import BeautifulSoup
import requests
import os
import time
import urllib
url = "https://www.shicimingju.com/book/sanguoyanyi.html"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
page_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(page_text,"lxml")
ss = soup.select(".book-mulu>ul>li>a")
fp = open("./sanguo.txt","a",encoding="utf-8")
for i in ss:
title = i.string #章节标题
zhangjie = 'https://www.shicimingju.com'+i["href"]
content_txt = requests.get(url=zhangjie,headers=headers).text
count = BeautifulSoup(content_txt,"lxml")
div_tag = count.find("div", class_="chapter_content")
content = div_tag.text
fp.write(title+":"+content+"
")
print(title,"下载成")
time.sleep(3)
fp.close()
xpath 解析
解析原理实例化一个etree的对象,将加载的数据加载到该对象中,需要调用etree对象中的xpath结合着不同的xpath表达式进行标签定位和不同的数据提取。
etree实例化对象:
etree.parse("filepath"):将数据加载到etree中,
etree.HTML("page_text"):将互联网的数据加载到该对象中。
HTML所有的标签都是尊从树状结构,便利我们实现高效的遍历和查找(定位)xpath方法返回的一定是复数(列表)
标签定位:
-最左侧的/:xpath一定从根标签开始进行定位。
-非最左侧的/:表示一个层级
-最左侧的//:从任意位置进行标签定位
-非最左侧//:表示多个层级
-//tageName:定位到所有的tageName标签
-属性定位://tageName[@arrtName="value"]
-索引定位://tageName[index],index索引从1开始
取文本:
- /text():取直系文本内容列表只有一个元素
- //text():所有文本内容,列表有多个元素
我们显示一个例子:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
from lxml import etree
url = "https://www.huya.com/g/lol"
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="J_liveCardList"]/ul/li')
for numb,li in enumerate(li_list):
# 实现局部解析:将局部的标签下指定内容进行解析
# 局部解析xpath表达式中的最左测的,./的表示就是xpath方法调用相应的标签
st1r=str(numb)
title=li.xpath(f'./a[2]/text()')[0]
hot = li.xpath('./span/span[2]/i[2]/text()')[0]
detail_url =li.xpath('./a[1]/@href')
print(title,hot,detail_url)
xpath 爬取乱码处理
from lxml import etree
import requests
import time
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
# url = "https://pic.netbian.com/4kmeinv/index_%d.html"
# for page in range(1,11):
# new_url = format(url%page)
# if page ==1:
# new_url = "https://pic.netbian.com/4kmeinv/"
# page_text = requests.get(new_url,headers=headers).text
# tree = etree.HTML(page_text)
# list_li = tree.xpath('//*[@id="main"]/div[3]/ul/li')
# for li in list_li:
# img_name = li.xpath('./a/img/@alt')[0]+".jpg"
# img_name = img_name.encode('iso-8859-1').decode("gbk")
# img_src = "https://pic.netbian.com"+ li.xpath('./a/img/@src')[0]
# pic_data = requests.get(url=img_src, headers=headers).content # content 返回的数据是二进制的数据
# img_path = "4ktupian"+"/"+img_name
# with open(f'./{img_path}', 'wb')as f:
# f.write(pic_data)
# print(img_name, "下载充公")
# time.sleep(3)
# print(img_name,img_src)
xpath表达试中管道符的应用
目的:使xpath表达式更具有通用性
from lxml import etree
import requests
import time
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
url = "https://www.aqistudy.cn/historydata/"
page_text=requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
all_cities = tree.xpath('/html/body/div[3]/div/div[1]/div[2]/div[2]/ul/div[2]/li/a/text() | /html/body/div[3]/div/div[1]/div[2]/div[2]/ul/li/a/text()')
print(all_cities)