【python爬虫】scrapy入门5--xpath等后面接正则

比如我们要调试某网页:https://g.widora.cn/

shell不依赖工程环境

scrapy shell https://g.widora.cn/

类似页面F12,可用对象都列出来了,一般常用response

前面省略

2020-05-08 21:07:18 [asyncio] DEBUG: Using selector: KqueueSelector
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1118626d0>
[s]   item       {}
[s]   request    <GET https://g.widora.cn/>
[s]   response   <200 https://g.widora.cn/>
[s]   settings   <scrapy.settings.Settings object at 0x111bd7890>
[s]   spider     <DefaultSpider 'default' at 0x112103250>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2020-05-08 21:07:18 [asyncio] DEBUG: Using selector: KqueueSelector

查找某群号:xpath等支持re,extract、get等后面不支持re

In [1]: response.xpath("/html/body/div/div[5]/p/a").extract()                   

Out[1]: ['<a target="_blank" href="//shang.qq.com/wpa/qunwpa?idkey=f65cb90612db81ef9bee771440adb40c004933a18b7c0466a279486936aedc79" src="title=" style="color:#00a1d6">G.widora.cn 群(1031687050)</a>']

In [2]: response.xpath("/html/body/div/div[5]/p/a/text()").extract()            

Out[2]: ['G.widora.cn 群(1031687050)']

In [3]: response.xpath("/html/body/div/div[5]/p/a/text()")                      

Out[3]: [<Selector xpath='/html/body/div/div[5]/p/a/text()' data='G.widora.cn 群(1031687050)'>]

In [4]: response.xpath("/html/body/div/div[5]/p/a/text()").re('d+')            

Out[4]: ['1031687050']

终端写这个很麻烦,还是在浏览器上先调试通过再写代码 

 

原文地址:https://www.cnblogs.com/hightech/p/12853158.html