scrapy框架之CrawlSpider全站自动爬取

全站数据爬取的方式

　　1.通过递归的方式进行深度和广度爬取全站数据，可参考相关博文（全站图片爬取），手动借助scrapy.Request模块发起请求。

　　2.对于一定规则网站的全站数据爬取，可以使用CrawlSpider实现自动爬取。

CrawlSpider是基于Spider的一个子类。和蜘蛛一样，都是scrapy里面的一个爬虫类，但 CrawlSpider是蜘蛛的子类，子类要比父类功能多，它有自己的都有功能------ 提取链接的功能LinkExtractor（链接提取器）。Spider是所有爬虫的基类，其设计原则只是为了爬取start_url列表中网页，而从爬取到的网页中提取出的url进行继续的爬取工作使用CrawlSpider更合适。

项目创建

#创建工程项目：项目名CrawlSpiderPro可自定义
scrapy startproject CrawlSpiderPro
#切换到当前工程目录下
cd  CrawlSpiderPro
#创建爬虫文件，比普通的爬虫文件多了参数“-t crawl”
scrapy genspider -t crawl crawlSpiderTest www.xxx.com
#开启爬虫项目
scrapy crawl crawlSpiderTest

初始化爬虫文件解析　　

 1 class CrawlspidertestSpider(CrawlSpider):
 2     name = 'crawlSpiderTest'
 3     allowed_domains = ['www.xxx.com']
 4     start_urls = ['http://www.xxx.com/']
 5     #爬虫规则rules指定不同的规则解析器，一个Rule就是一个解析规则，可以定义多个
 6     rules = (
 7         #Rule是规则解析器；
 8         # LinkExtractor是连接提取器，提取符合allow规则的完整的url；
 9         #callback指定当前规则解析器的回调解析函数；
10         #follow指定是否将链接提取器继续作用到链接提取器提取出的链接网页；
11         Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
12     )
13 
14     def parse_item(self, response):
15         item = {}
16         #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
17         #item['name'] = response.xpath('//div[@id="name"]').get()
18         #item['description'] = response.xpath('//div[@id="description"]').get()
19         return item

东莞阳光网(http://wz.sun0769.com/index.php/question/report?page=)全站爬取案例：

　　1.爬虫脚本crawlSpiderTest.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 from CrawlSpiderPro.items import CrawlspiderproItem
 6 
 7 
 8 class CrawlspidertestSpider(CrawlSpider):
 9     name = 'crawlSpiderTest'
10     # allowed_domains = ['www.xxx.com']
11 
12     start_urls = ['http://wz.sun0769.com/index.php/question/report?page=']
13     #爬虫规则rules指定不同的规则解析器，一个Rule就是一个解析规则，可以定义多个
14     rules = (
15         #Rule是规则解析器；
16         # LinkExtractor是连接提取器，提取符合allow规则的完整的url；
17         #callback指定当前规则解析器的回调解析函数；
18         #follow指定是否将链接提取器继续作用到链接提取器提取出的链接网页；
19         #follow不指定默认False;
20         Rule(LinkExtractor(allow=r'page=d+'), callback='parse_item', follow=False),#提取页码
21         Rule(LinkExtractor(allow=r'question/d+/d+.shtml'), callback='parse_detail'),#提取详细信息页面
22     )
23 
24     def parse_item(self, response):
25         print(response)
26         item = CrawlspiderproItem()
27         tr_list=response.xpath('//*[@id="morelist"]/div/table[2]/tbody/tr/td/table/tbody/tr')
28 
29         for tr in tr_list:
30             item['identifier']=tr.xpath('./td[1]/text()').extract_first()#解析编号
31             item['title']=tr.xpath('/td[2]/a[2]/text()').extract_first()#解析标题
32             yield item
33 
34     def parse_detail(self, response):
35         print(12345678765)
36         item = CrawlspiderproItem()
37         #xpath解析不识别tbody
38         item['identifier']=response.xpath('/html/body/div[9]/table[1]/tr/td[2]/span[2]/text()').extract_first().split(':')[-1]
39         item['content']="".join(response.xpath('/html/body/div[9]/table[2]//text()').extract())
40 
41         yield item

crawlSpiderTest.py

　　2.itmes.py字段属性定义

 1 import scrapy
 2 
 3 
 4 #也可以定义两个类分别存储，最后在和管道通过编号字段进行汇总对应，然后持久化存储
 5 class CrawlspiderproItem(scrapy.Item):
 6    
 7     #编号
 8     identifier=scrapy.Field()
 9     #标题
10     title=scrapy.Field()
11     #内容
12     content=scrapy.Field()
13     pass

itmes.py

　　3.pipelines.py管道配置

1 #自定义持久化处理
2 class CrawlspiderproPipeline(object):
3     def process_item(self, item, spider):
4         print(item)
5         return item

pipelines.py

　　4.settings.py配置

#UA伪装
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
#robots协议
ROBOTSTXT_OBEY = False
#日志输出等级
LOG_LEVEL='ERROR'

#开启管道
ITEM_PIPELINES = {
   'CrawlSpiderPro.pipelines.CrawlspiderproPipeline': 300,
}