scrapy框架

scrapy是一个大而全的爬虫组件

安装:

    - Win:
                下载：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
                
                pip3 install wheel 
                pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl
                
                pip3 install pywin32
                
                pip3 install scrapy 
            - Linux:
                pip3 install scrapy

安装

使用

# 创建project
                scrapy  startproject xdb 
                
                cd xdb 
                
                # 创建爬虫
                scrapy genspider chouti chouti.com 
                scrapy genspider cnblogs cnblogs.com 
                
                # 启动爬虫
                scrapy crawl chouti

使用

流程

1. 创建project
                scrapy startproject 项目名称
                
                项目名称
                   项目名称/
                        - spiders                # 爬虫文件 
                            - chouti.py 
                            - cnblgos.py 
                            ....
                        - items.py                 # 持久化
                        - pipelines                # 持久化
                        - middlewares.py        # 中间件
                        - settings.py             # 配置文件（爬虫）
                   scrapy.cfg                    # 配置文件（部署）
            
            2. 创建爬虫 
                cd 项目名称
                
                scrapy genspider chouti chouti.com 
                scrapy genspider cnblgos cnblgos.com 
                
            3. 启动爬虫
                scrapy crawl chouti 
                scrapy crawl chouti --nolog

流程

具体操作爬取chouti

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.response.html import HtmlResponse
from scrapy.http import Request
import sys,os,io
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com'] #定向爬虫,只爬这一个网站
    start_urls = ['http://chouti.com/']#起始url

    def parse(self, response): #回调函数
        f = open('news.log', mode='a+')
        item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
        # print(item_list)
        for item in item_list:
            text = item.xpath('.//a/text()').extract_first()
            href = item.xpath('.//a/@href').extract_first()
            print(href.strip())
            print(text.strip())
            f.write(href + '
')
        f.close()

        page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        for page in page_list:
            from scrapy.http import Request
            page = "https://dig.chouti.com" + page
            yield Request(url=page, callback=self.parse)  # https://dig.chouti.com/all/hot/recent/2

抽屉