第七篇 css选择器实现字段解析

CSS选择器的作用实际和xpath的一样，都是为了定位具体的元素

举例我要爬取下面这个页面的标题

In [20]: title = response.css(".entry-header h1")

In [21]: title
Out[21]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' entry-header ')]/descendant-or-self::*/h1" data='<h1>谷歌用两年时间研究了 180 个团队，发现高效团队有这五个特征</h1>'>]

In [22]: title = response.css(".entry-header h1").extract()

In [23]: title
Out[23]: ['<h1>谷歌用两年时间研究了 180 个团队，发现高效团队有这五个特征</h1>']

In [24]: ##可以使用css的::text取到内容

In [25]: title = response.css(".entry-header h1::text").extract()

In [26]: title
Out[26]: ['谷歌用两年时间研究了 180 个团队，发现高效团队有这五个特征']

获取文章创建日期：

In [38]: date_text = response.css(".entry-meta-hide-on-mobile").extract()

In [39]: date_text
Out[39]: ['<p class="entry-meta-hide-on-mobile">

            2017/08/23 ·  <a href="http://blog.jobbole.com/category/career/" rel="category tag">职场</a>
            
                            · <a href="#article-comment"> 7 评论 </a>
            

            
             ·  <a href="http://blog.jobbole.com/tag/google/">Google</a>, <a href="http://blog.jobbole.com/tag/%e5%9b%a2%e9%98%9f/">团队</a>
            
</p>']

In [40]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()

In [41]: date_text
Out[41]: 
['

            2017/08/23 ·  ',
 '
            
                            · ',
 '
            

            
             ·  ',
 ', ',
 '
            
']

In [42]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()[
    ...: 0]

In [43]: date_text
Out[43]: '

            2017/08/23 ·  '

In [44]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()[
    ...: 0].strip()

In [45]: date_text
Out[45]: '2017/08/23 ·'

In [46]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()[
    ...: 0].strip().replace("·","").strip()

In [47]: date_text
Out[47]: '2017/08/23'

获取评论数

In [49]: comment_num = response.css("a[href='#article-comment']")

In [50]: comment_num
Out[50]: 
[<Selector xpath="descendant-or-self::a[@href = '#article-comment']" data='<a href="#article-comment"> 7 评论 </a>'>,
 <Selector xpath="descendant-or-self::a[@href = '#article-comment']" data='<a href="#article-comment"><span class="'>]

In [51]: comment_num = response.css("a[href='#article-comment'] span::text").ext
    ...: ract()

In [52]: comment_num
Out[52]: [' 7 评论']

In [53]: comment_num = response.css("a[href='#article-comment'] span::text").ext
    ...: ract().strip()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-53-18ae8761867f> in <module>()
----> 1 comment_num = response.css("a[href='#article-comment'] span::text").extract().strip()

AttributeError: 'list' object has no attribute 'strip'

In [54]: comment_num = response.css("a[href='#article-comment'] span::text").ext
    ...: ract()[0]

In [55]: comment_num
Out[55]: ' 7 评论'

In [56]:

View Code

PS:css选择器里，不同标签使用空格隔开