scrapy

windowns 10

这里我使用的是python3.7

创建虚拟环境

pip install virtualenv
pip install virtualenvwrapper-win
# 创建虚拟环境 first_pro
mkvirtualenv first_pro 如果报错“mkvirtualenv不是内部命令或者外部命令”,请在环境配置中添加有mkvirtual.bat下路径目录
# 删除虚拟环境
rmvirtualenv first_pro
# 退出虚拟环境
deactivate
#进入虚拟环境first_pro
workon first_pro


安装scrapy包

pip install scrapy
如果不成功,则按照以下步骤:
1.下载Twisted,下载地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
进入包的目录下:pip install Twisted-19.2.1-cp37-cp37m-win_amd64.whl
2.安装scrapy
pip install scrapy
如果报错:“Consider using the `--user` option or check the permissions”的情况
pip install --user scrapy

创建一个scrapy项目

步骤1:
scrapy startproject mySpider
步骤2:
cd mySpider
步骤3:
scrapy genspider [lvbo] [s.hc360.com]

 scrapy目录

mySpider
  -mSpider
    -spiders
      __init__.py
      lvbo.py
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py

案列,爬取慧聪网,铝箔相关内容

lvbo.py

import scrapy
from mySpider.items import MyspiderItem
import time


class LvboSpider(scrapy.Spider):
    name = 'lvbo'
    allowed_domains = []
    start_urls = ['https://s.hc360.com/seller/search.html?kwd=%E9%93%9D%E7%AE%94']

    def parse(self, response):
        li_list = response.xpath("//div[@class='s-layout']//div[@class='wrap-grid']//li")
        for li in li_list:
            item = MyspiderItem()
            url = li.xpath(".//div[@class='NewItem']//a/@href").extract_first()
            if url:
                url = 'https:' + url
                yield scrapy.Request(
                    url,
                    callback=self.parse_detail,
                    meta={"item": item}
                )
            time.sleep(0.2)

        # 翻页
        next_url = response.xpath("//a[text()='下一页']/@href").extract_first()
        if next_url:
            next_url = 'https:' + next_url
            print(next_url)
            yield scrapy.Request(
                next_url,
                callback=self.parse
            )

    def parse_detail(self, response):
        item = response.meta["item"]

        company_name = response.xpath("//div[@class='word-box']/div/div[@class='p sate']/em/text()").extract_first()
        name = response.xpath("//div[@class='word-box']/div/div[@class='p name']/em/text()").extract_first()
        phone = response.xpath("//div[@class='word-box']/div/div[@class='p tel2']/em/text()").extract_first()
        if company_name and name and phone:
            item["company_name"] = company_name.lstrip("")
            item["name"] = name.replace(u'xa0', u' ').strip()
            item["phone"] = phone.lstrip("")
            print(item)
            yield item
View Code

settings.py

# 日志打印等级
LOG_LEVEL = "WARNING"
# 不遵守roboot规则
ROBOTSTXT_OBEY = False
# 下载间隔3秒
DOWNLOAD_DELAY = 3
# 开启管道
ITEM_PIPELINES = {
   'mySpider.pipelines.MyspiderPipeline': 300,
}

pipelines.py

import json

class MyspiderPipeline(object):
  # 程序运行就打开记事本
def open_spider(self, spider): self.file = open('lvbo.txt', 'w', encoding='utf-8')   # 运行结束关闭 def close_spider(self, spider): self.file.close()   # 将数据写入记事本 def process_item(self, item, spider):
     # 中文编码 ensure_ascii=False line
= json.dumps(dict(item), ensure_ascii=False) + ' ' self.file.write(line) return item

items.py

import scrapy


class MyspiderItem(scrapy.Item):
    # define the fields for your item here like:
    company_name = scrapy.Field()
    name = scrapy.Field()
    position = scrapy.Field()
    phone = scrapy.Field()
    site = scrapy.Field()

运行

DOS命令窗口,进入mySpider目录下:cd mySpider

...mySpider> scrapy crawl lvbo

如果使用pycharm进行DeBug,则在setting.py文件同级下创建一个start.py文件

start.py

# -*- coding:utf-8 -*-
from scrapy import cmdline

cmdline.execute("scrapy crawl lvbo".split())

在pycharm中的Run/Debug Configuration 的 Script path : C:codemySpidermySpiderstart.py

原文地址:https://www.cnblogs.com/aqiuboke/p/11132754.html