采用requests库构建简单的网络爬虫

Date: 2019-06-09

Author: Sun

我们分析格言网 https://www.geyanw.com/, 通过requests网络库和bs4解析库进行爬取此网站内容。

项目操作步骤

  1. 创建项目文件夹

    --geyanwang
       ---spiders  # 保存我们爬虫代码
          ---- geyan.py # 爬虫的代码
       ---doc   # 操作步骤说明文档
    
  2. 创建虚拟环境

    cd   geyanwang/
    virtualenv spider  --python=python3  # 创建venv虚拟环境
    
  3. 安装依赖库

    $ source venv/bin/activate
    (spider) $ pip install requests
    (spider) $ pip install lxml
    (spider) $ pip install bs4
    
  4. 编写代码 spiders/geyan.py

# -*- coding: utf-8 -*-  
__author__ = 'sun'
__date__ = '2019/6/19 下午2:22' 

from bs4 import BeautifulSoup as BSP4

import requests

g_set = set()

def store_file(file_name, r):
	html_doc = r.text
	with open("geyan_%s.html"%file_name, "w") as f:
		f.write(html_doc)

def download(url, filename='index'):
	'''
	:param url: 待下载页面地址
	:return: 页面内容
	'''
	r = requests.get(url)   #发送url请求,得到url网页内容

	store_file(filename, r)
	return r


def parse_tbox(tbox, base_domain):
	'''
	解析某个小说类别
	:param tbox:
	:param base_domain:
	:return:
	'''
	tbox_tag = tbox.select("dt a")[0].text
	print(tbox_tag)

	index = 0
	li_list = tbox.find_all("li")
	for li in li_list:
		link = base_domain + li.a['href']
		print("index:%s, link:%s" % (index, link))
		index += 1
		if link not in g_set:
			g_set.add(link)
			filename = "%s_%s" % (tbox_tag, index)
			sub_html = download(link, filename)


def parse(response):
	'''
	对页面进行解析
	:param response: 页面的返回内容
	:return:
	'''
	base_domin = response.url[:-1]
	g_set.add(base_domin)
	#print(base_domin)
	html_doc = response.content
	soup = BSP4(html_doc, "lxml")
	tbox_list = soup.select("#p_left   dl.tbox")  #小说
	[parse_tbox(tbox, base_domin)  for tbox in tbox_list]



def main():
	base_url = "https://www.geyanw.com/"
	response = download(base_url)
	parse(response)


if __name__ == "__main__":
	main()
  1. 运行上述代码,会产生一堆的html文件至本地

作业

上述geyan.py文件中只处理了首页

如何按照类别分页爬取相关内容,采用多线程实现

eg:

https://www.geyanw.com/lizhimingyan/

https://www.geyanw.com/renshenggeyan/

将爬取的网页以文件夹命名不同的方式进行保存至本地

原文地址:https://www.cnblogs.com/sunBinary/p/11055662.html