使用mongodb作scrapy爬小说的存储

一、背景：学习mongodb，考虑把原使用mysql作scrapy爬小说存储的程序修改为使用mongodb作存储。

二、过程：

1、安装mongodb

（1）配置yum repo

(python) [root@DL ~]# vi /etc/yum.repos.d/mongodb-org-4.0.repo

[mngodb-org]
name=MongoDB Repository
baseurl=http://mirrors.aliyun.com/mongodb/yum/redhat/7Server/mongodb-org/4.0/x86_64/
gpgcheck=0
enabled=1

（2）yum安装

(python) [root@DL ~]# yum -y install mongodb-org

（3）启动mongod服务

(python) [root@DL ~]# systemctl start mongod

（4）进入mongodb的shell

(python) [root@DL ~]# mongo
MongoDB shell version v4.0.20

...

To enable free monitoring, run the following command: db.enableFreeMonitoring()
To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
---
>

（5）安装pymongo模块

(python) [root@DL ~]# pip install pymongo
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pymongo
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/13/d0/819074b92295149e1c677836d72def88f90814d1efa02199370d8a70f7af/pymongo-3.11.0-cp38-cp38-manylinux2014_x86_64.whl (530kB)
|████████████████████████████████| 532kB 833kB/s
Installing collected packages: pymongo
Successfully installed pymongo-3.11.0

2、修改pipeline.py程序

(python) [root@localhost xbiquge_w]# vi xbiquge/pipelines.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 7 import os
 8 import time
 9 from twisted.enterprise import adbapi
10 from pymongo import MongoClient
11 
12 class XbiqugePipeline(object):
13     conn = MongoClient('localhost',27017)
14     db = conn.novels #建立数据库novels的连接对象db
15     #name_novel = ''
16 
17     #定义类初始化动作
18     #def __init__(self):
19 
20     #爬虫开始
21     #def open_spider(self, spider):
22 
23         #return
24     def clearcollection(self, name_collection):
25         myset = self.db[name_collection]    
26         myset.remove()
27 
28     def process_item(self, item, spider):
29         #if self.name_novel == '':
30         self.name_novel = item['name']
31         self.url_firstchapter = item['url_firstchapter']
32         self.name_txt = item['name_txt']
33 
34         exec('self.db.'+ self.name_novel + '.insert_one(dict(item))')
35         return item
36 
37     #从数据库取小说章节内容写入txt文件
38     def content2txt(self,dbname,firsturl,txtname):
39         myset = self.db[dbname]
40         record_num = myset.find().count() #获取小说章节数量
41         print(record_num)
42         counts=record_num
43         url_c = firsturl
44         start_time=time.time()  #获取提取小说内容程序运行的起始时间
45         f = open(txtname+".txt", mode='w', encoding='utf-8')   #写方式打开小说名称加txt组成的文件
46         for i in range(counts):  #括号中为counts
47             record_m = myset.find({"url": url_c},{"content":1,"by":1,"_id":0})
48             record_content_c2a0 = ''
49             for item_content in record_m:
50                 record_content_c2a0 = item_content["content"]  #获取小说章节内容
51             #record_content=record_content_c2a0.replace(u'xa0', u'')  #消除特殊字符xc2xa0
52             record_content=record_content_c2a0
53             #print(record_content)
54             f.write('
')
55             f.write(record_content + '
')
56             f.write('

')
57             url_ct = myset.find({"url": url_c},{"next_page":1,"by":1,"_id":0})  #获取下一章链接的查询对象
58             for item_url in url_ct:
59                 url_c = item_url["next_page"]  #下一章链接地址赋值给url_c，准备下一次循环。
60         f.close()
61         print(time.time()-start_time)
62         print(txtname + ".txt" + " 文件已生成！")
63         return
64 
65     #爬虫结束，调用content2txt方法，生成txt文件
66     def close_spider(self,spider):
67         self.content2txt(self.name_novel,self.url_firstchapter,self.name_txt)
68         return

2、修改爬虫程序

(python) [root@localhost xbiquge_w]# vi xbiquge/spiders/sancun.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from xbiquge.items import XbiqugeItem
 4 from xbiquge.pipelines import XbiqugePipeline
 5 
 6 class SancunSpider(scrapy.Spider):
 7     name = 'sancun'
 8     allowed_domains = ['www.xbiquge.la']
 9     #start_urls = ['http://www.xbiquge.la/10/10489/']
10     url_ori= "http://www.xbiquge.la"
11     url_firstchapter = "http://www.xbiquge.la/10/10489/4534454.html"
12     name_txt = "./novels/三寸人间"
13 
14     pipeline=XbiqugePipeline()
15     pipeline.clearcollection(name) #清空小说的数据集合（collection），mongodb的collection相当于mysql的数据表table
16     item = XbiqugeItem()
17     item['id'] = 0         #新增id字段，便于查询
18     item['name'] = name
19     item['url_firstchapter'] = url_firstchapter
20     item['name_txt'] = name_txt
21 
22     def start_requests(self):
23         start_urls = ['http://www.xbiquge.la/10/10489/']
24         for url in start_urls:
25             yield scrapy.Request(url=url, callback=self.parse)
26 
27     def parse(self, response):
28         dl = response.css('#list dl dd')     #提取章节链接相关信息
29         for dd in dl:
30             self.url_c = self.url_ori + dd.css('a::attr(href)').extract()[0]   #组合形成小说的各章节链接
31             #print(self.url_c)
32             #yield scrapy.Request(self.url_c, callback=self.parse_c,dont_filter=True)
33             yield scrapy.Request(self.url_c, callback=self.parse_c)    #以生成器模式（yield）调用parse_c方法获得各章节链接、上一页链接、下一页链接和章节内容信息。
34             #print(self.url_c)
35     def parse_c(self, response):
36         #item = XbiqugeItem()
37         #item['name'] = self.name
38         #item['url_firstchapter'] = self.url_firstchapter
39         #item['name_txt'] = self.name_txt
40         self.item['id'] += 1
41         self.item['url'] = response.url
42         self.item['preview_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[1]
43         self.item['next_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[3]
44         title = response.css('.con_top::text').extract()[4]
45         contents = response.css('#content::text').extract()
46         text=''
47         for content in contents:
48             text = text + content
49         #print(text)
50         self.item['content'] = title + "
" + text.replace('15', '
')     #各章节标题和内容组合成content数据，15是^M的八进制表示，需要替换为换行符。
51         yield self.item     #以生成器模式（yield）输出Item对象的内容给pipelines模块。

4、修改items.py

(python) [root@DL xbiquge_w]# vi xbiquge/items.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://docs.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class XbiqugeItem(scrapy.Item):
12     # define the fields for your item here like:
13     # name = scrapy.Field()
14     id = scrapy.Field()
15     name = scrapy.Field()
16     url_firstchapter = scrapy.Field()
17     name_txt = scrapy.Field()
18     url = scrapy.Field()
19     preview_page = scrapy.Field()
20     next_page = scrapy.Field()
21     content = scrapy.Field()

三、小结

mongodb作爬虫存储比mysql更简洁。