【Python3 爬虫】U20_正则表达式爬取古诗文网


本文使用的是正则表达式爬取古诗文网,爬取的信息有:标题、朝代、作者、内容等信息

1.网站分析


通过上图,我已将需要爬取的信息与标签的对应位置根据不同的颜色标记出来,标题位于class="cont"的div标签下的b标签中,朝代与作者都位于class="source"的p标签下的a标签中,内容信息位于class="contson"的div标签中,知道这些后,我们便可以使用正则表达式来匹配得出我们需要的信息了

2.抓取代码

# Author:Logan
import requests
import re

HEADERS = {
    'User_Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}


def parse_url(url):
    response = requests.get(url, headers=HEADERS)
    text = response.text

    titles = re.findall(r'<divsclass="cont">.*?<b>(.*?)</b>',text,re.DOTALL)
    dynasties = re.findall(r'<psclass="source">.*?<a.*?>(.*?)</a>', text, re.DOTALL)
    authors = re.findall(r'<span>:</span>.*?<a.*?>(.*?)</a>', text, re.DOTALL)
    contents = re.findall(r'<divsclass="contson".*?>(.*?)</div>', text, re.DOTALL)
    peoms = []
    for content in contents:
        x = re.sub('<.*?>',"",content).strip()
        peoms.append(x)

    result = []
    for value in zip(titles,dynasties,authors,peoms):
        title, dynasty, author, peom = value
        ret = {
            "title":title,
             "dynatie":dynasty,
             "author": author,
             "peom":peom
        }
        result.append(ret)

    for gsw in result:
        print(gsw)
        print("=" * 30)




def main():
    base_url = 'https://so.gushiwen.org/shiwen/default_2A9cb3b7c0e4a0A{}.aspx'
    for i in range(1,12):
        url = base_url.format(i)
        print(url)
        parse_url(url)


if __name__ == '__main__':
    main()

抓取截图:

原文地址:https://www.cnblogs.com/OliverQin/p/12626836.html