直接上代码,顺便在这里记录,时间2190906.
刚开始爬贝壳网的,发现有反爬虫,我也不会绕,换了链家网,原来中文也可以做变量。
spider.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 from beike.items import BeikeItem 5 6 class BeikewSpider(scrapy.Spider): 7 name = 'beikew' 8 allowed_domains = ['lianjia.com'] 9 start_urls = ['https://su.lianjia.com/ershoufang/'] 10 page = 1 11 12 13 def parse(self, response): 14 li_list = response.xpath('//*[@id="content"]/div[1]/ul/li') 15 for li in li_list: 16 item = BeikeItem() 17 name = li.xpath('./div[1]/div[1]/a/text()').extract_first() 18 单价 = li.xpath('./div[1]/div[6]/div[2]/span/text()').extract_first() 19 totalprice = li.xpath('./div[1]/div[6]/div[1]/span/text()').extract_first() 20 xiaoqu = li.xpath('./div[1]/div[2]/div/a/text()').extract_first() 21 local = li.xpath('./div[1]/div[3]/div/a/text()').extract_first() 22 item['name'] = name 23 item['单价'] = 单价 #在这里试试中文的,才知道原来中文也可以做变量 24 item['totalprice'] = totalprice 25 item['xiaoqu'] = xiaoqu 26 item['local'] = local 27 yield item 28 29 if self.page <= 50:#这里爬取了50页数据,可以随意更改 30 self.page += 1 31 url_new = str(self.page) 32 new_page_url = 'https://su.lianjia.com/ershoufang/pg' + url_new 33 yield scrapy.Request(url = new_page_url, callback = (self.parse))
item.py
1 import scrapy 2 3 class BeikeItem(scrapy.Item): 4 xiaoqu = scrapy.Field() 5 name = scrapy.Field() 6 单价 = scrapy.Field() 7 totalprice = scrapy.Field() 8 local = scrapy.Field()
settings.py
1 BOT_NAME = 'beike' #这些代码在settings里启用或者添加的。 2 SPIDER_MODULES = ['beike.spiders'] 3 NEWSPIDER_MODULE = 'beike.spiders' 4 FEED_EXPORT_ENCODING ='utf-8' 5 FEED_EXPORT_ENCODING = 'gb18030' 6 ROBOTSTXT_OBEY = True 7 DOWNLOAD_DELAY = 1
只用到了3个y文件,其他的都是命令生成的,保持默认。
执行结果: