爬取糗事百科热门段子的数据并保存到本地,xpath的使用

和之前的爬虫类博客的爬取思路基本一致:

  • 构造url_list,因为糗事百科的热门栏目默认是13页,所以这个就简单了
  • 遍历发送请求获取响应
  • 提取数据,这里用的是xpath提取,用的是Python的第三方模块lxml
  • 保存数据到本地
  • 爬取的数据有:段子内容、作者性别、作者年龄、作者头像的地址、被标记为好笑的次数

数据处理:

  • 把段子内容中的换行都消除
  • 获取性别操作稍微麻烦一点
  • 头像图片的地址补全
  • 判断是否存在,不存在用None替代
  • 如果想了解更多,可以去 https://www.qiushibaike.com/text/ 抓包分析

程序代码:

 1 import requests
 2 import json
 3 from lxml import etree
 4 
 5 
 6 class QiubaSpider(object):
 7     """爬取糗事百科的热门下的数据"""
 8 
 9     def __init__(self):
10         self.url_temp = 'https://www.qiushibaike.com/text/page/{}/'
11         self.headers = {
12             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
13         }
14 
15     def get_url_list(self):  # 构造url_list
16         return [self.url_temp.format(i) for i in range(1, 14)]
17 
18     def pass_url(self, url):  # 发送请求
19         print(url)
20         response = requests.get(url, headers=self.headers)
21         return response.content.decode()
22 
23     def get_content_list(self, html_str):  # 提取数据
24         html = etree.HTML(html_str)
25         div_list = html.xpath('//div[@id="content-left"]/div')  # 分组
26         content_list = []
27         for div in div_list:
28             item = {}
29             # 底下全是利用xpath和一些函数对数据的处理
30             item['content'] = div.xpath('.//div[@class="content"]/span/text()')
31             item['content'] = [i.replace('
', '') for i in item['content']]
32             item['author_gender'] = div.xpath('.//div[contains(@class, "articleGend")]/@class')
33             item['author_gender'] = item['author_gender'][0].split(' ')[-1].replace('Icon', '') if len(
34                 item['author_gender']) > 0 else None
35             item['author_age'] = div.xpath('.//div[contains(@class, "articleGend")]/text()')
36             item['author_age'] = item['author_age'][0] if len(item['author_age']) > 0 else None
37             item['author_img'] = div.xpath('.//div[@class="author clearfix"]//img/@src')
38             item['author_img'] = 'https' + item['author_img'][0] if len(item['author_img']) > 0 else None
39             item['stats_vote'] = div.xpath('.//span[@class="stats-vote"]/i/text()')
40             item['stats_vote'] = item['stats_vote'][0] if len(item['stats_vote']) > 0 else None
41             content_list.append(item)
42         return content_list
43 
44     def save_content_list(self, content_list):
45         with open('qiuba.txt', 'a', encoding='utf-8') as f:
46             f.write(json.dumps(content_list, ensure_ascii=False, indent=4))
47             f.write('
')  # 换行
48 
49 
50     def run(self):  # 实现主要逻辑
51         # 1.构造url_list,热门的一共13页
52         url_list = self.get_url_list()
53         # 2.遍历发送请求,获取响应
54         for url in url_list:
55             html_str = self.pass_url(url)
56             # 3.提取数据
57             content_list = self.get_content_list(html_str)
58             # 4.保存数据
59             self.save_content_list(content_list)
60         pass
61 
62 
63 if __name__ == '__main__':
64     qiubai = QiubaSpider()
65     qiubai.run()
原文地址:https://www.cnblogs.com/springionic/p/11115194.html