scrapy框架的基本使用

安装scrapy:

Windows:

      a. pip3 install wheel

      b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

      c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl

      d. pip3 install pywin32

      e. pip3 install scrapy

一、scrapy的基本使用

1. 创建一个工程
  scrapy startproject firstBlood
2. 切换到工程目录中   cd proName 3. 新建一个爬虫文件   scrapy genspider first www.example.com     first:是文件名     www.example.com :这是爬取的起始url

4. 执行工程:scrapy crawl spiderName

创建的爬虫文件说明

# -*- coding: utf-8 -*-
import scrapy


class FirstSpider(scrapy.Spider):
    # 爬虫文件的唯一标识
    name = 'first'
    #表示允许的域名,用来做限定
    #allowed_domains = ['www.example.com']
    #其实url列表:只能存放url
    #作用:列表中存放的url可以被scrapy进行请求发送
    start_urls = ['http://baidu.com/','www.sogou.com']

    #用于数据解析
    def parse(self, response):
        print(response)

配置文件settings一般修改三个地方

#1.robots协议,表示不准从robots协议
ROBOTSTXT_OBEY = False

#2 UA伪装
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'


#添加日志级别
LOG_LEVEL = 'ERROR'

二、基于管道持久化存储

  1. 在爬虫文件中做数据解析

  2. 将解析到的数据封装存储到Item类型的对象中

  3. 将Item类型的对象提交给管道

  4. 在管道中进行任意形式的持久化存储

    5. 在配置文件中开启管道

2.1 爬煎蛋网设计页面的标题和内容并持久化存储

爬虫文件代码:

# -*- coding: utf-8 -*-
import scrapy


from JiandanPro.items import JiandanproItem

class JiandanSpider(scrapy.Spider):
    name = 'jiandan'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://jandan.net/tag/设计']

    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div')
        for div in div_list:
            title = div.xpath('./div/h2/a/text()').extract_first()
            content = div.xpath('.//div[@class="indexs"]/text()').extract()
            content = ''.join(content)
            if title and content:
                item = JiandanproItem()
                item['title'] = title
                item['content'] = content

                yield item #将item提交给管道

items代码

import scrapy

class JiandanproItem(scrapy.Item):
    # define the fields for your item here like:
    #Field是一个万能的数据类型
    title = scrapy.Field()
    content = scrapy.Field()

pipelines代码

class JiandanproPipeline(object):
    fp = None
    #重写父类的一个方法
    def open_spider(self,spider):
        self.fp = open('./data.txt','w',encoding='utf-8')
        print('i am openSpider,我只会被调用一次!')

    #用来接收item并且对其进行任意持久化存储
    #pip install -U redis==2.10.6
    def process_item(self, item, spider):
        title = item['title']
        content = item['content']

        self.fp.write(title+':'+content+'
')

        return item #item传递给下一个即将被执行的管道类

    def close_spider(self, spider):
        self.fp.close()
        print('i am close_spider,我只会被调用一次!')

settings配置

ITEM_PIPELINES = {
   'JiandanPro.pipelines.JiandanproPipeline': 300,
}

2.2 基于mysql做数据备份

  1. 一个管道类负责将数据写入一个平台

  2. 爬虫文件提交的item只会提交给优先级最高的管道类

  3. 如果使得所有的管道都可以接受到item呢?

    在process_item方法中,进行item的返回即可

创建数据库和表
 
create database spider;
use spider
create table jiandan(title varchar(300),content varchar(500));

pipeline代码

import pymysql

class MysqlPipeing(object):
    conn = None  #连接对象
    cursor = None
    def open_spider(self,spider):
        self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='wang',db='spider',charset='utf8')
        print(self.conn)
    def process_item(self,item,spider):
        title = item['title']
        content = item['content']
        self.cursor = self.conn.cursor()
        sql = 'insert into jiandan values ("%s","%s")'%(title,content)
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

redis做数据持久化

from redis import Redis

class RedisPipeline(object):
conn = None
def open_spider(self,spider):
self.conn = Redis(host='127.0.0.1',port=6379)

def process_item(self,item,spider):
#将字典向redis中写会报错
# pip install -U redis == 2.10.6
self.conn.lpush('dataList',item)
print(item)
return item

ps: 需要在settings里面注册才能生效

三、手动请求发送和全站数据爬取

  yield scrapy.Request(url,callback)  发get请求

  yield scrapy.FormRequest(new,formdata,callback) 发post请求

案列 爬取:http://wz.sun0769.com/index.php/question/questionType?page= 前六页

import scrapy

class SunSpider(scrapy.Spider):
    name = 'sun'
    #allowed_domains = ['www.xx.com']
    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?page=']
    url = 'http://wz.sun0769.com/index.php/question/questionType?page=%d'
    page = 30

    def parse(self, response):
        tr_list= response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
        for tr in tr_list:
            title = tr.xpath('./td[2]/a[2]/text()').extract_first()
            print(title)

        if self.page <= 150:
            new_url = format(self.url%self.page)
            self.page += 30
            #手动请求发送
            yield scrapy.Request(new_url,callback=self.parse)

四 、五大核心组件

  1. 引擎:

    用来处理整个系统的数据流处理,触发事件(框架核心)

  2. 调度器:

    用来接受引擎发过来的请求,放入队列中,并在引擎再次请求的时候返回

    - 过滤器

    -队列

  3. 下载器:

    用于下载网页内容,并将网页内容返回,下载器是建立在twisted这个高效异步模型之上

  4. 爬虫(spider):

    用于在特定的网页中提取自己需要的信息

  5. 管道(pipeline):

    负责处理爬虫从网页中国提取的实体,主要功能是持久化存储

五、请求传参

  作用:让scrapy实现深度爬取

  深度爬取:抓取的数据没有存储在同一张页面中

      -通过scrapy.Request(url,callback,meta)中的meta字典传递

    -在callback中通过respnse.meta接受meta这个字典

爬虫代码:

# -*- coding: utf-8 -*-
import scrapy
from RequestSendPro.items import RequestsendproItem

class ParamdemoSpider(scrapy.Spider):
    name = 'paramDemo'
    #allowed_domains = ['www.xx.com']
    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?page=']
    url = 'http://wz.sun0769.com/index.php/question/questionType?page=%d'
    page = 30
    def parse(self, response):
        tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
        for tr in tr_list:
            item = RequestsendproItem()
            title = tr.xpath('./td[2]/a[2]/text()').extract_first()
            item['title'] = title
            detail_url = tr.xpath('./td[2]/a[2]/@href').extract_first()
            # 对详情页的url发情请求
            #meta这个字典的含义是可以将字典传递给callback
            yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
        if self.page <= 150:
            new_url = format(self.url%self.page)
            self.page += 30
            yield scrapy.Request(new_url,callback=self.parse)
     #解析新闻内容
    def parse_detail(self,response):
        #接收meta
        item = response.meta['item']
        content = response.xpath('/html/body/div[9]/table[2]//tr[1]/td//text()').extract()
        content = ''.join(content)
        item['content'] = content
        yield item

六、中间件

  种类:

    1.下载中间件

    2. 爬虫中间件

  作用:批量拦截请求和响应

  为什么需要啊拦截请求:

    - 设定代理

      process_exception():

        request.meta['proxy'] = 'http://ip:port'

    - 篡改请求头信息(UA)

      process_headers['User_Agent'] = 'xxx'

meddlewares 代码:

# -*- coding: utf-8 -*-
from scrapy import signals
import random
user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

class MiddleproDownloaderMiddleware(object):
    #拦截请求
    #spider:爬虫类实例化的对象
    def process_request(self, request, spider):
        print('I am process_request')
        # 基于UA池进行UA伪装
        request.headers['User-Agent'] = random.choices(user_agent_list)

        #代理
        # request.meta['proxy'] = 'https://58.246.228.218:1080'
        return None
    #拦截所有的响应
    def process_response(self, request, response, spider):

        return response
    #拦截异常的请求
    def process_exception(self, request, exception, spider):
        print('I am process_exception')
        # 代理
        request.meta['proxy'] = 'https://58.246.228.218:1080'
        #将修正后的对象进行重新发送
        #return request

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
View Code

 爬虫文件代码:

 1 class MiddleSpider(scrapy.Spider):
 2     name = 'middle'
 3     #allowed_domains = ['www.xx.com']
 4     start_urls = ['https://www.baidu.com/s?wd=ip']
 5 
 6     def parse(self, response):
 7         page_text = response.text
 8         ips = response.xpath('//*[@id="1"]/div[1]/div[1]/div[2]/table//tr/td/span').extract_first()
 9         print(ips)
10         with open('ip.txt','w',encoding='utf-8') as f:
11             f.write(page_text)
View Code

 七、CrawlSpider

  概述:是spider的一个子类

  作用:用于实际全站数据爬取

  使用:

    1. 创建工程

    2. cd ProName

    3. scrapy genspider -t crawl spiderName 起始url

  连接提取器(LinkExtractor): 可以根据指定的规则进行指定的链接的提取

  规则解析器(Rule):可以将LinkExtractor提取出的链接进行请求发送,然后根据指定做数据解析

爬虫文件代码:

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 
 6 class MovieSpider(CrawlSpider):
 7     name = 'movie'
 8     #allowed_domains = ['www.xx.com']
 9     start_urls = ['https://www.4567tv.tv/index.php/vod/show/class/动作/id/1.html']
10     # 连接提取器
11     # 作用:根据指定规则(allow)进行连接(url)的提取
12 
13     link = LinkExtractor(allow=r'id/1/page/d+.html')
14     rules = (
15         #实例化一个Rule类型的对象
16         #Rule: 规则解析器
17         # 作用:可以对链接提取器提取到的链接进行请求发送,按照指定规则进行数据解析
18         Rule(link, callback='parse_item', follow=True),
19     )
20 
21     #用于数据解析
22     def parse_item(self, response):
23         # 解析
24        print(response)

八、分布式

   概念:组建一个分布式机群,然后让其执行同一组程序,联合爬取同一个资源中的数据

   实现方式:scrapy+redis(scrapy和scrapy_redis组件)

   原生scrapy不可以实现共享的原因:

    1. 调度器不可以被共享

    2. 管道不可以被共享

    scrapy_redis组件作用:

    可以提供被共享的管道和调度器

    环境安装:

    pip install scrapy-redis

    编码流程:

    修改爬虫文件:

    1.导包:from scrapy_redis.spiders import RedisCrawlSpider

       2.修改爬虫文件父类

     3. 删除start_url和

     4.添加一个redis_key的属性,属性值任意字符串即可

     5. 爬虫文件的常规操作

     6. 编写配置文件settings

      - 指定管道:

      ITEM_PIPELINES = {

        'scrapy_redis.pipelines.RedisPipeline':400

      }

      - 指定调度器:

      #增加了一个去重容器类的配置

      DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

      #使用scrapy_redis自建自己的调度器

      SCHEDULER = "scrapy_redis.scheduler.Scheduler"

      #配置调度器是否要持久化

      SCHEDULER_PERSIST = True
  
     
REDIS_HOST = '192.168.2.201'
  
        REDIS_PORT = 6379

      -
修改redis的配置文件 redis.windows.conf
     
- 启动redis的服务端和客户端
     - 将起始url放入到可以被共享的调度队列中
       - 队列是存在于redis数据库中
       - redis-cli

未完待续......

  

      

  

原文地址:https://www.cnblogs.com/guniang/p/11748416.html