scrapy里的链接提取器简单使用

1.实例环境

mac os

scrapy 1.3.3

conda env

2.使用读书网为例子:
   https://www.dushu.com/book/1107.html

2.1  

xpath规则提取

 1 #打开shell交互
 2 scrapy shell https://www.dushu.com/book/1107.html
 3 
 4 #导入包
 5 from scrapy.linkextractors import LinkExtractor
 6 
 7 #实例化对象,并填入xpath规则
 8 link=LinkExtractor(restrict_xpaths=r'//div[@class="pages"]/a')
 9 #查看对象
10 link
11 
12 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor at 0x7f8d05d42278>
13 
14 #填入response进行extract
15 link.extract_links(response)
16 
17 #接下来就会输出links

注意 提取器在填写行xpath规则时,不能解析到属性

错误://div[@class="pages"]/a/@href

正确://div[@class="pages"]/a

2.2

css规则提取

 1 #打开shell交互
 2 scrapy shell https://www.dushu.com/book/1107.html
 3 
 4 #导入包
 5 from scrapy.linkextractors import LinkExtractor
 6 
 7 #实例化对象,并填入css规则
 8 link=LinkExtractor(restrict_css=r'.pages > a')
 9 #查看对象
10 link
11 
12 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor at 0x7f8d05d42278>
13 
14 #填入response进行extract
15 link.extract_links(response)
16 
17 #接下来就会输出links

 

注意:在填入css规则,解析到标签时注意空格:

错误:r'.pages>a'

正确:r'.pages > a'

 

 

2.3

正则规则提取

 1 #打开shell交互
 2 scrapy shell https://www.dushu.com/book/1107.html
 3 
 4 #导入包
 5 from scrapy.linkextractors import LinkExtractor
 6 
 7 #实例化对象,并填入re规则
 8 link=LinkExtractor(allow=r'/book/1107_d+.html')
 9 #查看对象
10 link
11 
12 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor at 0x7f8d05d42278>
13 
14 #填入response进行extract
15 link.extract_links(response)
16 
17 #接下来就会输出links

xpath和css由于没有达到属性,所以link对象会得到多余东西,但是不用担心,scrapy会自动优化。

个人建议正则提取,配合多页下载

原文地址:https://www.cnblogs.com/cheflone/p/13639814.html