Scrapy框架1——简单使用

一、设置与编写

打开cmd,选择好路径

1.创建项目`scrapy startproject projectname`

d:爬虫11.scrapy>scrapy startproject  testproject

2.生成模板`scrapy genspider testspider www.xxx.com`

d:爬虫11.scrapy	estproject>`scrapy genspider testspider www.xxx.com`

3.配置

3.1.打开testspider.py

# -*- coding: utf-8 -*-
import scrapy


class TestspiderSpider(scrapy.Spider):
    name = 'testspider'
   # allowed_domains = ['www.xxx.com'] #爬取的网站必须属于该域名，否则不会爬取,所以一般不用
    start_urls = ['https://xueqiu.com/'] #起始的url列表

    def parse(self, response):  #回调函数，进行解析或数据储存reponse为请求后的响应对象，
        title = response.xpath('//*[@id="app"]/div[3]/div[1]/div[2]/div[2]/div[1]/div/h3/a/text()').extract()
        author = response.xpath('////*[@id="app"]/div[3]/div[1]/div[2]/div[2]/div[1]/div/div/div[1]/a[2]text()')
        print(title) 
        return dict(zip(author,title))

#xpath函数返回的列表中存放的数据为Selector类型的数据。我们解析到的内容被封装在了Selector对象中，需要调用extract()方法将解析的内容从Selecor中取出。

3.2.打开settings.py

#对user-agent进行修改

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36'

ROBOTSTXT_OBEY = False  #将True改为False，否则将遵守robots协议

4.通过cmd执行

scrapy crawl testspider ：该种执行形式会显示执行的日志信息
scrapy crawl testspider --nolog：该种执行形式不会显示执行的日志信息

d:爬虫11.scrapy	estproject>scrapy crawl testspider --nolog
['一家真正懂金融的金融科技公司，成功转型助贷业务….', '大江奔涌，日夜前行，看好中国，做多中国股市！', '最好的时代！最好的地方！', '在A股年化收益达到100%后对股市的一些思考', '大国制造从芯片 说起', '【悬赏】拼多多又双叒叕新高，他能站稳电商第二极吗？', '童装霸主巴拉巴拉的爸爸：森马服饰解析（上）', '不了解股票的强相关，你将永远陷入股票投机的怪圈', '充电5分钟、续航150公里 宁德时代推出动力电池新技术', '选择困难，重庆农商行发行询价该报多少？']

二、持久化储存

1.基于终端指令的持久化储存

保证parse方法返回一个可迭代类型的对象

 scrapy crawl 爬虫文件名称 -o xxx.json
 支持的文件格式：json、csv、xml、pickle、marshal
 保存为json格式，会转换成bytes类型

D:爬虫11.scrapy estproject estproject>scrapy crawl testspider -o title.json --nolog

2.基于管道的持久化储存

流程：

在爬虫文件中进行数据解析
在item类中声明相关的属性用于存储解析到的数据
将解析到的数据封装到item类的对象中
将item对象提交给管道
item对象会作为管道类中的process_item的参数进行处理
process_item方法中编写基于item持久化存储的操作
在setting中开启管道

items.py ：结合抓取到的数据定义类用于实例化存储数据

import scrapy
class TestprojectItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author  = scrapy.Field()
    title = scrapy.Field()
# ------------提交到MySQL -------------------

testspider.py ：数据解析、封装并提交到管道

# -*- coding: utf-8 -*-
import scrapy

from testproject.items import TestprojectItem
class TestspiderSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['www.xxx.com']
    start_urls = ['https://xueqiu.com/']

    def parse(self, response):
        title = response.xpath('//*[@id="app"]/div[3]/div[1]/div[2]/div[2]/div[1]/div/h3/a/text()').extract()
        author = response.xpath('//*[@id="app"]/div[3]/div[1]/div[2]/div[2]/div[1]/div/div/div[1]/a[2]/text()').extract() 
        for i,j in zip(title,author):
            item = TestprojectItem()
            item['author'] = j
            item['title'] = i
            yield item  
 #将数据封装的item对象中，并返回给管道

pipelines.py:将抓取到的数据保存

import pymysql
# ------------ Pipeline:保存本地 ------------------------
class TestprojectPipeline(object):
    def __init__(self):
        self.f = None
    def open_spider(self,spider):
        self.f = open('./雪球.txt','w',encoding='utf8')
        
    def process_item(self, item, spider):
        author = item['author']
        title = item['title']
        self.f.write(author+':'+title+'
')
        return item    #将数据返回给下一个管道对象
    def close_spider(self,spider):
        self.f.close()
 #------------- Pipeline:导入mysql数据库 -------------------
class MysqlPipeline(object):
    conn = None
    cur = None
    def open_spider(self,spider):
        self.conn = pymysql.connect(
        	host = '192.168.1.4',
        	port = 3306,
            user = 'syx',
            password = '123',
            database = 'spider',
            charset = 'utf8',
        )
        self.cur = self.conn.cursor()
    def process_item(self, item, spider):
        author = item['author']
        title = item['title']
        sql = 'insert into xueqiu values("%s","%s")' % (author,title)
        try:
            self.cur.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item
    def close_spider(self,spider):
        print('finish')
        self.cur.close()
        self.conn.close()

setting.py:对ITEM_PIPELINES进行配置，数值小优先级高

ITEM_PIPELINES = {
    'testproject.pipelines.TestprojectPipeline': 300,
    'testproject.pipelines.MysqlPipeline': 301,
}
#日志级别设置：INFO、DEBUG、ERROR
LOG_LEVEL = 'ERROR'