爬虫---10.scrapy框架(叁-深度爬取)

  • 深度爬取 即 爬取的数据没有在同一个页面 首页+详情页

  • 在scrapy中如果没有请求传参我们是无法持久化存储数据

  • 实现方式:

    • scrapy.Request(url, callback, meta)
      • meta是一个字典 可以将meta传递给callback
    • callback取出meta
      • response.meta

           <details>
           <summary>点击查看代码</summary>
        
           ```
        
                class MovieSpider(scrapy.Spider):
                    name = 'movie'
                    # allowed_domains = ['www.baidu.com']
                    start_urls = ['https://bj.5i5j.com/xiaoqu/xichengqu/']
        
                    def parse(self, response):
                        li_lst = response.xpath('/html/body/div[6]/div[1]/div[2]/ul/li')
                        # print(li_lst)
                        for li in li_lst:
                            title = li.xpath('./div[2]/h3/a/text()').extract_first()
                            print(title)
                            detail_url = li.xpath('./div[1]/a/@href').extract_first()
                            detail_url = 'https://bj.5i5j.com' + detail_url
        
                            item = MovieproItem()
                            item['title'] = title
        
                            # 对详情页url发起请求
                            # meta作用:可以将meta字典传递给callback
                            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'item': item})
        
                    # 被用作于解析详情页的数据
                    def parse_detail(self, response):
                        # 接收传递过来的meta
                        item = response.meta['item']
                        desc_lst = response.xpath('/html/body/div[5]/div[3]/div[3]/div[1]/div/ul/li')
                        for result in desc_lst:
                            desc = result.xpath('./span/text()').extract_first()
                            print(desc)
                            # item['desc'] = desc
        
                            yield item
        
           ```
           </details>
原文地址:https://www.cnblogs.com/FGdeHB/p/15506541.html