Scrapy爬虫day2——简单运行爬虫

设置setting.py

修改机器人协议

ROBOTSTXT_OBEY = False

设置User-Agent

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3294.99 Safari/537.36'
}

添加start.py

为了能在IDE中使用,方便爬虫运行在爬虫组件同目录下创建start.py文件

from scrapy import cmdline
cmdline.execute("scrapy crawl wx_spider".split())

目录树

E:.
│  scrapy.cfg
│
│  
└─BookSpider
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  start.py 
    │  __init__.py
    │
    ├─spiders
    │  │  biqubao_spider.py
    │  │  __init__.py
    │  │
    │  └─__pycache__
    │          biqubao_spider.cpython-36.pyc
    │          __init__.cpython-36.pyc
    │
    └─__pycache__
            settings.cpython-36.pyc
            __init__.cpython-36.pyc

在爬虫下添加以下代码,打印出页面信息

#biqubao_spider.py    
def parse(self, response):
        print("*"*50)
        print(response.text)
        print("*" * 50)
原文地址:https://www.cnblogs.com/luocodes/p/11794113.html