【初学Python】01-第一个小说爬虫

在之前建站的时候,用C#做过一个爬图片网站图片的接口,代码写了一大串,最近看到朋友写爬虫,发现代码量是真滴少,于是乎了解学习了一下Python,实现了个最简单的小说爬虫,没有什么高级功能,也没用多线程之类的,就是一个很简单的基础爬虫,因为此前没有学习过Python,上来直接面对菜鸟教程编程,所以实现中还是遇到了一些小问题的,正好也水一篇随笔来记录一下过程。

分析阶段

首先是选择一个被害者,额...目标对象,根据我多年盗版书阅读经验,很快就选好了目标——某趣阁。

随便找一本书,点进去是书籍的目录结构,他们家网站布局还挺不错的,比较规整,目录分为两大部分,最近更新和全部目录,全部目录也没有分页,都直接显示到页面上了,也无须去分页获取了,有点没难度啊。

看完目录之后点进去看文章内容,F12检查可以发现每一句话都是一个P标签,啊这...,好吧,是有点简单。

简单分析完目录和内容之后就可以开始干活了,有了要拿的东西,无非就是考虑怎么实现罢了。既然要使用Python,那肯定得先看看基础语法,emmm,没有大括号,也不用分号,代码块靠缩进区别,这........ 没有关系!上手写两行就适应了,有一说一,适应之后还真的感觉挺简洁的。

思考一下处理步骤,大致可以分为五步:

  1. 找一本书,拿到目录链接
  2. 伪装head头,以免直接被ban(但是其实这家也没有检测这个,不过也可能是我爬的不狠)
  3. 爬到目录页面的目录链接和标题
  4. 遍历目录,爬到文章内容
  5. 将文章内容拼接保存文件
  6. 完事,开始水文章

编码阶段

  1. 找本书,https://www.xquge.com/book/1771.html

  2. 伪装请求头

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36','Referer':'https://www.xquge.com'}
    
  3. 获取目录内容

    catalog=requests.get('https://www.xquge.com/book/1771.html',headers=headers, verify=False).content.decode()
    

    之所以调用decode()方法是因为获取到的内容是乱码的,所以需要转一下。这个时候遇到了一个问题:

    Traceback (most recent call last):
      File "D:Program FilesPython39libsite-packagesurllib3connectionpool.py", line 696, in urlopen
        self._prepare_proxy(conn)
      File "D:Program FilesPython39libsite-packagesurllib3connectionpool.py", line 964, in _prepare_proxy
        conn.connect()
      File "D:Program FilesPython39libsite-packagesurllib3connection.py", line 359, in connect
        conn = self._connect_tls_proxy(hostname, conn)
      File "D:Program FilesPython39libsite-packagesurllib3connection.py", line 496, in _connect_tls_proxy
        return ssl_wrap_socket(
      File "D:Program FilesPython39libsite-packagesurllib3utilssl_.py", line 432, in ssl_wrap_socket
        ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls)
      File "D:Program FilesPython39libsite-packagesurllib3utilssl_.py", line 474, in _ssl_wrap_socket_impl
        return ssl_context.wrap_socket(sock)
      File "D:Program FilesPython39libssl.py", line 500, in wrap_socket
        return self.sslsocket_class._create(
      File "D:Program FilesPython39libssl.py", line 1040, in _create
        self.do_handshake()
      File "D:Program FilesPython39libssl.py", line 1309, in do_handshake
        self._sslobj.do_handshake()
    FileNotFoundError: [Errno 2] No such file or directory
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "D:Program FilesPython39libsite-packages
    equestsadapters.py", line 439, in send
        resp = conn.urlopen(
      File "D:Program FilesPython39libsite-packagesurllib3connectionpool.py", line 755, in urlopen
        retries = retries.increment(
      File "D:Program FilesPython39libsite-packagesurllib3util
    etry.py", line 573, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.xquge.com', port=443): Max retries exceeded with url: /book/1771.html (Caused by ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory')))
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "f:NewOneDriveOneDrivePythonBookReptile.py", line 20, in <module>
        catalog=requests.get('https://www.xquge.com/book/1771.html',headers=headers, verify=False).content.decode()
      File "D:Program FilesPython39libsite-packages
    equestsapi.py", line 76, in get
        return request('get', url, params=params, **kwargs)
      File "D:Program FilesPython39libsite-packages
    equestsapi.py", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "D:Program FilesPython39libsite-packages
    equestssessions.py", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "D:Program FilesPython39libsite-packages
    equestssessions.py", line 655, in send
        r = adapter.send(request, **kwargs)
      File "D:Program FilesPython39libsite-packages
    equestsadapters.py", line 510, in send
        raise ProxyError(e, request=request)
    requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.xquge.com', port=443): Max retries exceeded with url: /book/1771.html (Caused by ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory')))
    

    其实看到Proxy字样就知道大概因为啥子了,把我的小飞机登云梯关闭之后问题消失。但是紧接着又出现了一个新的问题,不算报错,就是一个警告:

    D:Program FilesPython39libsite-packagesurllib3connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.xquge.com'. Adding certificate verification is strongly advised. 
    

    经过Search之后,了解到这个是因为请求https才出现的,咱们也没携带啥子证书啥的,也用不到,所以就让他不提示就好,给加句代码就好了:

    import urllib3
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
    

    这样子获取到内容也不会报错了,获取到内容之后,要把内容转换为xpath对象,这样去获取节点非常方便,而且浏览器还提供了直接复制某个接点的xpath路径,就很棒,然后我们获取到全部目录里头的节点。

    from lxml import etree
    # 处理成xpath对象
    html=etree.HTML(catalog)
    # 获取到所有的目录节点
    chapters=html.xpath('/html/body/div[1]/div[6]/div[5]/div[2]/ul/li/a')
    

    打印一下chapters,输出一个数组如下,里头都不是人言,没关系,至少说明获取到东西了。

    [<Element a at 0x26c538cca40>, <Element a at 0x26c5387dd00>, <Element a at 0x26c539c1280>, <Element a at 0x26c539c19c0>, <Element a at 0x26c539c1a00>, <Element a at 0x26c539c1440>, <Element a at 0x26c539c1b40>, <Element a at 0x26c539c18c0>, <Element a at 0x26c539c17c0>, <Element a at 0x26c539c1800>, <Element a at 0x26c539c1780>, <Element a at 0x26c539c1740>, <Element a at 0x26c539c1640>, <Element a at 0x26c539c16c0>, <Element a at 0x26c539c1dc0>, <Element a at 0x26c539c1e40>, <Element a at 0x26c539e1e80>, <Element a at 0x26c539e1e40>, <Element a at 0x26c539ccec0>, <Element a at 0x26c539ccdc0>, <Element a at 0x26c539ccf40>, <Element a at 0x26c539ccf80>, <Element a at 0x26c52dd10c0>, <Element a at 0x26c539f5b40>, <Element a at 0x26c539f5d80>, <Element a at 0x26c539f5cc0>, <Element
    a at 0x26c539f5e00>]
    

    完事就可以开始处理每个页面里面的内容了,比较习惯定义一个函数,看了下Python的函数定义,也非常简单,那么就来写一个处理函数吧。

    amount=len(chapters)
    nowIndex=0
    # 定义函数处理内容页面
    def processingChapter(url,title):
        content=requests.get(url,headers=headers, verify=False).content.decode()
        html=etree.HTML(content) # 转xpath
        lines=html.xpath('//*[@id="content"]/p[@class="bodytext"]/text()') # 获取小说文本内容集合
        finalStr='
    '.join(lines) #使用指定字符将数组拼接为字符串
        fileName='files/'+title+'.txt' # 拼接文件名
        fileWriter=codecs.open(fileName,'w','utf-8') #打开文件写
        fileWriter.writelines(finalStr) # 写入字符串
        fileWriter.flush()
        fileWriter.close()
        global nowIndex # 在方法外部定义的变量,在方法内部使用时需要使用global关键字,否则会报已释放错误
        nowIndex+=1
        print(fileName+' 已保存'+str(nowIndex)+'/'+str(amount))
        pass
    

    代码里是将每一个章节保存为一个单独的文件,因为写这个例子也不是真的为了去爬书,只是了解学习一下,所以就没有将内容填充到同一个文件。

    处理文章内容有了,那么遍历目录进行处理就可以了,再定义一个函数:

    # 定义函数遍历目录,发送请求
    def processingDirectory(chapters):
        for chapter in chapters:
            url=chapter.xpath('./@href')[0] # 获取链接
            title=chapter.xpath('./text()')[0] # 获取标题
            processingChapter(url,title) # 调用方法处理章节内容
            time.sleep(0.8) # 请求太快容易被ban,也会出现跳章的问题
        pass
    

    最后调用删除:

    # 开始搞事
    processingDirectory(chapters)
    
    

执行效果

files/001 划重点?.txt 已保存1/1304
files/002 既治病,也要命!.txt 已保存2/1304
files/003 我都要.txt 已保存3/1304
files/004 我祝福你.txt 已保存4/1304
files/005 我再祝福你.txt 已保存5/1304
files/006 一年有三百六十五个日出.txt 已保存6/1304
files/007 减肥.txt 已保存7/1304
files/008 你会恨我的.txt 已保存8/1304
files/009 皮.txt 已保存9/1304
files/010 闯祸?.txt 已保存10/1304
files/011 出发!皮卡皮!.txt 已保存11/1304

至此,一个最简单的小虫子就写好了,没啥技术难度,只是之前完全没有接触过,所以做了解学习用。也许对程序里某些地方用法的理解不太正确,还请多多指正。

最终源码

#小白第一个爬虫
#爬取笔趣阁小说
#0.先伪装一个head
#1.输入一本书的地址 https://www.xquge.com/book/1771.html
#2.爬取目录的链接和标题
#3.遍历目录,请求到文章内容
#4.处理文章将内容输出到文件
#5.完事,开始吹水

import requests
import urllib3
from lxml import etree
import time
import codecs
# 去除https证书的提示
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# 伪装一个请求头,避免直接被ban,不过该网站并没有ban掉没有请求头的
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36','Referer':'https://www.xquge.com'}
# 获取目录页面
catalog=requests.get('https://www.xquge.com/book/1771.html',headers=headers, verify=False).content.decode()
# 处理成xpath对象
html=etree.HTML(catalog)
# 获取到所有的目录节点
chapters=html.xpath('/html/body/div[1]/div[6]/div[5]/div[2]/ul/li/a')
amount=len(chapters)
nowIndex=0
# 定义函数处理内容页面
def processingChapter(url,title):
    content=requests.get(url,headers=headers, verify=False).content.decode()
    html=etree.HTML(content) # 转xpath
    lines=html.xpath('//*[@id="content"]/p[@class="bodytext"]/text()') # 获取小说文本内容集合
    finalStr='
'.join(lines) #使用指定字符将数组拼接为字符串
    fileName='files/'+title+'.txt' # 拼接文件名
    fileWriter=codecs.open(fileName,'w','utf-8') #打开文件写
    fileWriter.writelines(finalStr) # 写入字符串
    fileWriter.flush()
    fileWriter.close()
    global nowIndex # 在方法外部定义的变量,在方法内部使用时需要使用global关键字,否则会报已释放错误
    nowIndex+=1
    print(fileName+' 已保存'+str(nowIndex)+'/'+str(amount))
    pass
# 定义函数遍历目录,发送请求
def processingDirectory(chapters):
    for chapter in chapters:
        url=chapter.xpath('./@href')[0] # 获取链接
        title=chapter.xpath('./text()')[0] # 获取标题
        processingChapter(url,title) # 调用方法处理章节内容
        time.sleep(0.8) # 请求太快容易被ban,也会出现跳章的问题
    pass
# 开始搞事
processingDirectory(chapters)

原文地址:https://www.cnblogs.com/LiuDanK/p/14055600.html