CSV

爬取 csv 格式数据与 xml 等方法基本一致

使用下列的表格:

name sex addr email
Alex Boy Los Angeles alex@hotstone.com
Coy Girl Los Angeles, coy@hotstone.com
Couch Boy California couch@hotstone.com
Tom Girl New York tom@hotstone.com

创建一个项目:

$ scrapy startproject mycsv

创建 CSV 模板:

cd mycsv
$ scrapy genspider -t csvfeed mycsvspider localhost

编写 items 代码:

import scrapy
 
class MycsvItem(scrapy.Item):
    name = scrapy.Field()
    sex = scrapy.Field()

编写 spider 文件:

# -*- coding: utf-8 -*-
from scrapy.spiders import CSVFeedSpider
from mycsv.items import MycsvItem
 
class MycsvspiderSpider(CSVFeedSpider):
    name = 'mycsvspoder'
    allowed_domains = ['localhost']
    start_urls = ['http://localhost/feed.csv']
    # headers = ['id', 'name', 'description', 'image_link']
    # delimiter = ' '
    # 定义 headers
    headers = ['name''sex''addr''email']
    # 定义间隔符
    delimiter = ','
 
    # Do any adaptations you need here
    #def adapt_response(self, response):
    #    return response
 
    def parse_row(self, response, row):
        i = MycsvItem()
        #i['url'] = row['url']
        #i['name'] = row['name']
        #i['description'] = row['description']
        i['name'] = row['name'].encode()
        i['sex'] = row['sex'].encode()
        print(" 名字是: ")
        print(i['name'])
        print("性别是: ")
        print(i['sex'])
        print("---------------------------")
        return i

项目下保存 csv 文件名 feed.csv 内容都是以逗号分隔

使用 Docker 启动本地 HTTP 服务,主要用途是访问 csv 文件:

cd mycsv
$ docker run -d -w /data -p 80:8080 -v ${PWD}:/data slzcc/java-webserver:jenkins-java-webserver-14 java -jar /usr/src/app/app.jar 8080

启动完成后可以检测是否可以访问:

创建 main.py 文件:

from scrapy import cmdline
cmdline.execute("scrapy crawl mycsvspider".split())

结果如下:

原文地址:https://www.cnblogs.com/dalton/p/11353857.html