###
爬虫解决翻页问题
import scrapy class Spider1Spider(scrapy.Spider): name = 'spider1' allowed_domains = ['17k.com'] start_urls = ['https://www.17k.com/all/book/2_0_0_5_1_0_2_0_1.html'] def parse(self, response): ret = response.xpath("//table//td//span/a") # print(ret) for i in ret: print(i.xpath('./text()').extract_first()) print(i.xpath('./@href').extract_first()) yield { "text": i.xpath('./text()').extract_first(), "href": i.xpath('./@href').extract_first() } next_url = response.xpath("//div[@class='page']/a[last()-1]/@href").extract_first() # print(next_url) if next_url != "javascript:void(0)": next_url = "https://www.17k.com"+next_url print(next_url) yield scrapy.Request(next_url, callback=self.parse) if __name__ == '__main__': from scrapy import cmdline cmdline.execute("scrapy crawl spider1 --nolog".split())
对这个翻页的认知
1,处理方式,callback是自己,因为和上一次的处理方式是一样的,这里很像一个递归,什么时候结束,就是next_url没有的时候,就结束了,
2,如果处理方式不一样,就要另外定义一个函数,单独处理了,
######
useragent设置
这里的请求都没有设置useragent,没有header, 那这个怎么设置?
在setting里面:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'
####
处理数据的时候使用MongoDB
from pymongo import MongoClient myclient = MongoClient() mycollection = myclient["dbname"]["tablename"] class MyspiderPipeline: def process_item(self, item, spider): # item["hello"] = "world" mycollection.insert(item) print(item) return item
中间使用了MongoDB存储数据,
1,本地先启动MongoDB,mongod --dbpath /usr/local/var/mongodb --logpath /usr/local/var/log/mongodb/mongo.log --fork
###
学习items的使用
item的代码
import scrapy class ScrapyDemo1Item(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() href = scrapy.Field() clickNum = scrapy.Field()
####
爬虫的代码,采集列表页和详情页的数据,并且合并起来,
import scrapy from scrapy_demo1.items import ScrapyDemo1Item class Spider1Spider(scrapy.Spider): name = 'spider1' allowed_domains = ['17k.com'] start_urls = ['https://www.17k.com/all/book/2_0_0_5_1_0_2_0_1.html'] def parse(self, response): ret = response.xpath("//table//td//span/a") # print(ret) for i in ret: item = ScrapyDemo1Item() item["title"] = i.xpath('./text()').extract_first() item["href"] = "https:" + i.xpath('./@href').extract_first() # print(item) yield scrapy.Request( item["href"], callback=self.getDetail, meta={"item": item} ) # 翻页 next_url = response.xpath("//div[@class='page']/a[last()-1]/@href").extract_first() if next_url != "javascript:void(0)": next_url = "https://www.17k.com" + next_url print(next_url) yield scrapy.Request(next_url, callback=self.parse) def getDetail(self,response): # print(response.url) print(response.meta["item"]) item = response.meta["item"] item["clickNum"] = response.xpath("//td[@id='hb_week']/text()").extract_first() yield item if __name__ == '__main__': from scrapy import cmdline cmdline.execute("scrapy crawl spider1 --nolog".split())
#####
这个必须要好好的认知
1,yield request,请求下一页
2,使用了meta,爬取详情页
3,item = ScrapyDemo1Item(),这一句一定要放到for循环里面,否则就会出现异常的错误,
注意,
1,最后的处理详情页的函数,最后的需要yield item,把这个传递到pipeline里面去
2,extract_first(),还有extract()这两个的区别,一个是取第一个,一个是取下面的所有,
3,scrapy里面有twisted异步框架, 所以,进行parse的时候,parse_detail的时候可能同时再处理,不是先后的顺序,
####
pipeline深度使用
class ScrapyDemo1Pipeline: def open_spider(self, spider): print("open") def close_spider(self, spider): print("close") def process_item(self, item, spider): print(item) # item["hello"] = "world" # print(dir(spider)) # print(spider.name) # print("--------") return item
####
1,open_spider,只会在开始执行一次,像数据库的连接可以放到open_spider,好处就是只连接一次,
2,close_spider,只会在结束的时候执行,
####
####
###