Web Scraping using Python Scrapy_BS4

Scrapy Architecture

 Creating a Spider.

  Spiders are classes that you define that Scrapy uses to scrape(extract) information from a website(s).

import scrapy

class QuoteSpider(scrapy.Spider):
    name = "quote"
    start_urls = [
        'https://bluelimelearning.github.io/my-fav-quotes/'
    ]

    def parse(self, response):
        for quote in response.css('div.quotes'):
            yield{
                'quote':quote.css('p.aquote::text').extract(),
                'author':quote.css('p.author::text').extract_first(),
            }

Running your spider and saving scrapped data.

scrapy runspider quotes_spiders.py -o quotes.xml

 

https://www.cleancss.com/strip-xml/

Scraping data with Scrapy Shell

scrapy shell "https://bluelimelearning.github.io/my-fav-quotes/"

 

response.css('title')

response.css('title::text').extract()

response.css('h1::text').extract()

quote = response.css("div.quotes")[0]
aquote = quote.css("p.aquote::text").extract()
aquote

相信未来 - 该面对的绝不逃避,该执著的永不怨悔,该舍弃的不再留念,该珍惜的好好把握。
原文地址:https://www.cnblogs.com/keepmoving1113/p/11795043.html