scrapy框架_3持久化存储

scrapy持久化存储

基于终端的存储

  - 只可以将parse方法的返回值存储到本地的文本文件中
  - 只支持json jsonlines   jl  csv xml   marshal  pickle这几种文本
  - 好处:简洁,高效
  - 缺点:局限性比较强(数据只可以存储到指定后缀的文本文件中)
  - 指令:scrapy  crawl  名字  -o  地址           
  - 指令例子如下   
  scrapy  crawl  name -o  ./fliePath.csv

- 基于管道的存储

  - 编程流程:
            1. 数据解析
            2. 在item类中定义相关属性
            3. 将解析的数据封装到item类型的对象
            4. 将item类型的对象提交到管道进行持久化存储的操作
            5. 在管道类的process_item中要将其接收到item对象中存储的数据进行持久化存储操作
            6. 在配置文件中开启管道
  - 好处:
            - 通用性强

面试题:将爬取到的数据存储本地一份,一份到数据库,如何实现?

  - 管道文件中一个管道类对应的是将数据存储到一种平台
  - 爬虫文件提交的item只会给管道文件中第一个被执行的管道类接收
  - process_item方法中的return item 表示将item传递给下一个即将被执行的管道类

糗事百科案例_持久化存储

spiders文件夹下的名字为first.py的文件

# -*- coding: utf-8 -*-
import scrapy
from TestOne.items import TestoneItem
from scrapy.spiders import CrawlSpider
class FirstSpider(scrapy.Spider):

    #爬虫文件的名称:就是爬虫源文件唯一标识
    name = 'first'
    #允许的域名:用来限制start_urls那些url可以进行请求发送
    # allowed_domains = ['www.baidu.com','https://www.sogou.com/',]
    #启始url的列表:该列表存放的url会被scrapy自动请求发送
    start_urls = ['https://www.qiushibaike.com/text/',]

    #用于数据解析:response参数表示就是请求成功后对应的响应对象



    def parse(self, response):
        # 存储所有解析到的数据
        all_data = []
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        for div in div_list:
            # xpath返回的是列表,但是列表元素一定是Selector类型的对象
            # extract可以将Selector对象中的data参数存储的字符串提取出来
            # 第一种写法
            author = div.xpath('./div[1]/a[2]/h2//text()')[0].extract()
            # 第二种写法
            # author = div.xpath('./div[1]/a[2]/h2//text()').extract_first()

            # 列表调用了extract之后,则表示将列表中的Selector对象中的data对的的字符串提取出来
            content = div.xpath('./a/div/span//text()').extract()

            # 格式化
            content = ' '.join(content).replace('
', '')
            item=TestoneItem()
            item['author']=author
            item['content']=content
            yield  item#将item提交给了管道

item文件

#管道的文件
import scrapy


class TestoneItem(scrapy.Item):
    author=scrapy.Field()
    content=scrapy.Field()

pipelines.py文件

import  pymysql
#记得在数据库先建好表
class TestonePipeline(object):
    fp=None
    #重写父类的一个方法.该方法只在开始爬虫的时候调用一次
    def open_spider(self,spider):
        print('开始爬虫....')
        self.fp=open('./qiubai.txt','w',encoding='utf-8')
    #专门处理item类型对象的
    #该方法可以接收到爬虫文件提交过来的item对象
    #该方法每次接收到一个item会被调用一次
    def process_item(self, item, spider):
        author =item['author']
        content= item['content']
        self.fp.write(author+':'+content+'
')
        return item

    def close_spider(self,spider):
        print('结束爬虫....')
        self.fp.close()

class mysqlPileLine(object):
    conn=None
    cursor=None
    def open_spider(self,spider):
        print('开始写数据库')
            #host数据库ip地址  port端口   user账户 password 密码  db库
        self.conn=pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='root',db='qiubai')
    def process_item(self,item, spider):
        self.cursor=self.conn.cursor()
        try:
            #sql语句格式化输出
            self.cursor.execute('insert into qiubai value ("%s","%s")'%(item["author"],item["content"]))
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return  item
    def close_spider(self,spider):
        print('数据库结束!!!!!!')
        self.cursor.close()
        self.conn.close()

setting配置文件

BOT_NAME = 'TestOne'

SPIDER_MODULES = ['TestOne.spiders']
NEWSPIDER_MODULE = 'TestOne.spiders'

ITEM_PIPELINES = {
   'TestOne.pipelines.TestonePipeline': 300,
   'TestOne.pipelines.mysqlPileLine':200
    #300表示优先级,数值越小优先级越高
}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#UA伪装
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'

# Obey robots.txt rules
#robots协议是否遵从
ROBOTSTXT_OBEY = False
#显示指定类型的日志信息
LOG_LEVEL="ERROR"

scrapy框架_3持久化存储

scrapy持久化存储

基于终端的存储

- 基于管道的存储

面试题:将爬取到的数据存储本地一份,一份到数据库,如何实现?

糗事百科案例_持久化存储

spiders文件夹下的 名字为first.py的文件

item文件

pipelines.py文件

setting配置文件

spiders文件夹下的名字为first.py的文件