2scrapy框架使用-翻页,使用MongoDB存储,使用meta传递数据,items的使用,pipeline的深度使用

###

爬虫解决翻页问题

import scrapy


class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['17k.com']
    start_urls = ['https://www.17k.com/all/book/2_0_0_5_1_0_2_0_1.html']

    def parse(self, response):
        ret = response.xpath("//table//td//span/a")
        # print(ret)
        for i in ret:
            print(i.xpath('./text()').extract_first())
            print(i.xpath('./@href').extract_first())

            yield {
                "text": i.xpath('./text()').extract_first(),
                "href": i.xpath('./@href').extract_first()
            }
        next_url = response.xpath("//div[@class='page']/a[last()-1]/@href").extract_first()
        # print(next_url)
        if next_url != "javascript:void(0)":
            next_url = "https://www.17k.com"+next_url
            print(next_url)
            yield scrapy.Request(next_url, callback=self.parse)


if __name__ == '__main__':
    from scrapy import cmdline
    cmdline.execute("scrapy crawl spider1 --nolog".split())

对这个翻页的认知

1,处理方式,callback是自己,因为和上一次的处理方式是一样的,这里很像一个递归,什么时候结束,就是next_url没有的时候,就结束了,

2,如果处理方式不一样,就要另外定义一个函数,单独处理了,

###### 

useragent设置

这里的请求都没有设置useragent,没有header, 那这个怎么设置?

在setting里面:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'

####

处理数据的时候使用MongoDB

from pymongo import MongoClient

myclient = MongoClient()
mycollection = myclient["dbname"]["tablename"]


class MyspiderPipeline:
    def process_item(self, item, spider):
        # item["hello"] = "world"
        mycollection.insert(item)
        print(item)
        return item

中间使用了MongoDB存储数据,

1,本地先启动MongoDB,mongod --dbpath /usr/local/var/mongodb --logpath /usr/local/var/log/mongodb/mongo.log --fork

###

学习items的使用

item的代码

import scrapy


class ScrapyDemo1Item(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    href = scrapy.Field()
    clickNum = scrapy.Field()

####

爬虫的代码,采集列表页和详情页的数据,并且合并起来,

import scrapy
from scrapy_demo1.items import ScrapyDemo1Item


class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['17k.com']
    start_urls = ['https://www.17k.com/all/book/2_0_0_5_1_0_2_0_1.html']

    def parse(self, response):
        ret = response.xpath("//table//td//span/a")
        # print(ret)
        for i in ret:
            item = ScrapyDemo1Item()
            item["title"] = i.xpath('./text()').extract_first()
            item["href"] = "https:" + i.xpath('./@href').extract_first()

            # print(item)

            yield scrapy.Request(
                item["href"],
                callback=self.getDetail,
                meta={"item": item}
            )

        # 翻页
        next_url = response.xpath("//div[@class='page']/a[last()-1]/@href").extract_first()
        if next_url != "javascript:void(0)":
            next_url = "https://www.17k.com" + next_url
            print(next_url)
            yield scrapy.Request(next_url, callback=self.parse)


    def getDetail(self,response):
        # print(response.url)
        print(response.meta["item"])
        item = response.meta["item"]
        item["clickNum"] = response.xpath("//td[@id='hb_week']/text()").extract_first()
        yield item


if __name__ == '__main__':
    from scrapy import cmdline

    cmdline.execute("scrapy crawl spider1 --nolog".split())

#####

这个必须要好好的认知

1,yield request,请求下一页

2,使用了meta,爬取详情页

3,item = ScrapyDemo1Item(),这一句一定要放到for循环里面,否则就会出现异常的错误,

注意,

1,最后的处理详情页的函数,最后的需要yield item,把这个传递到pipeline里面去

2,extract_first(),还有extract()这两个的区别,一个是取第一个,一个是取下面的所有,

3,scrapy里面有twisted异步框架, 所以,进行parse的时候,parse_detail的时候可能同时再处理,不是先后的顺序,

####

pipeline深度使用

class ScrapyDemo1Pipeline:

    def open_spider(self, spider):
        print("open")

    def close_spider(self, spider):
        print("close")

    def process_item(self, item, spider):
        print(item)
        # item["hello"] = "world"
        # print(dir(spider))
        # print(spider.name)
        # print("--------")
        return item

####

1,open_spider,只会在开始执行一次,像数据库的连接可以放到open_spider,好处就是只连接一次,

2,close_spider,只会在结束的时候执行,

####

####

###

原文地址:https://www.cnblogs.com/andy0816/p/15058709.html