在用scrapy时遇到的坑

1. 一开始是想用scrapy和selenium来爬什么值得买，结果遇到了一个奇怪的问题，直接上代码

   def start_requests(self):
        self.logger.info("starting")
        broswer = webdriver.Firefox()
        broswer.get(self.start_url)
        last_height = broswer.execute_script("return document.body.scrollHeight")
        print(last_height)
        count = 0
        while True:
            print(count)
            if count==2:
                break
            broswer.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)
            new_height = broswer.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                break
            last_height = new_height
            time.sleep(1.2)
            count = count + 1
        source = broswer.page_source
        broswer.close()

        scrapy_selector = Selector(text = source)
        items_selector = scrapy_selector.xpath('//div[@class="z-feed-content"]')
        self.logger.info('Theres a total of ' + str(len(items_selector)) + ' links.')
        try:
            s=0
            for item_selector in items_selector:
                print(s)
                print(item_selector.getall())
#错误写法（Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.）
                # url_selector = item_selector.xpath('//h5[@class="feed-block-title has-price"]/a/@href')

                # 错误写法（multiple class should be: *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]）
                # url_selector = item_selector.xpath('.//h5[@class="feed-block-title has-price"]/a/@href')

                url_selector = item_selector.xpath(".//h5[contains(concat(' ', normalize-space(@class), ' '), 'feed-block-title')]/a/@href")

# assert isinstance(url_selector, scrapy.selector.Selector)
                print(url_selector.extract())
                # self.logger.info("sss" + url)

                url = url_selector.get()
                s = s + 1
                # self.logger.info("sss" + url)
        except Exception as e:
            self.logger.info('Reached last iteration #' + str(e) + str(s))

        return

broswer.page_source表示浏览器上的整个页面的html代码，scrapy_selector是建立在整个页面的选择器，items_selector是页面上抓下来的表示首页上商品信息的列表div块的选择器列表，这些都没问题。但出问题的是标成大红色的那段代码

url_selector = item_selector.xpath('//h5[@class="feed-block-title has-price"]/a/@href')

item_selector表示每一个商品信息div块的选择器，这个用print(item_selector.getall())打印出来是对的，出问题的是url_selector，表示div块里商品链接的选择器，print(url_selector.extract())打印出来发现是个url列表，共18个

['https://www.smzdm.com/p/20610761/#hfeeds', 'https://www.smzdm.com/p/20601553/#hfeeds', 'https://www.smzdm.com/p/20597500/#hfeeds', 'https://www.smzdm.com/p/20603303/#hfeeds', 'https://www.smzdm.com/p/20613198/#hfeeds', 'https://www.smzdm.com/p/20601438/#hfeeds', 'https://www.smzdm.com/p/20615602/#hfeeds', 'https://www.smzdm.com/p/20596520/#hfeeds', 'https://www.smzdm.com/p/20617429/#hfeeds', 'https://www.smzdm.com/p/20607426/#hfeeds', 'https://www.smzdm.com/p/20615296/#hfeeds', 'https://www.smzdm.com/p/20618224/#hfeeds', 'https://www.smzdm.com/p/20603149/#hfeeds', 'https://www.smzdm.com/p/20604376/#hfeeds', 'https://www.smzdm.com/p/20603224/#hfeeds', 'https://www.smzdm.com/p/20615599/#hfeeds', 'https://www.smzdm.com/p/20615846/#hfeeds', 'https://www.smzdm.com/p/20586712/#hfeeds']

item_selector共有60个，每个里面的url_selector都是一个一样的18个元素的列表（为什么是从头上数下来的18个，不得而知）。
错误原因后来在官网的文档上发现这样一句话

Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.）
If you use @class='someclass' you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass') to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass.

原因很明白了，如果xpath以/或//开始，会从整个文档解析而不是从那个item_selector开始解析。然后改成了

url_selector = item_selector.xpath('.//h5[@class="feed-block-title has-price"]/a/@href')

发现取不到值，后来在官网上又看到了一句话：

Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose: *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ').

If you use @class='someclass' you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass') to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass.


还是比较麻烦的，改成下面这样就能正确取到url了

url_selector = item_selector.xpath(".//h5[contains(concat(' ', normalize-space(@class), ' '), 'feed-block-title')]/a/@href")

 由于这样比较麻烦，官网建议取css时用下面的形式，先用css的取法，然后再加xpath

>>> sel.css('.shout').xpath('./time/@datetime').getall()

2. 页面html是

<h1 class="item-name">
                                <span class="edit_interface"></span>
                                闲鱼出售全新ipad pro 2018翻车日记                            </h1>

goods_scrapy_selector.xpath("//article/h1/text()")取出来的却是一个数组['
                                ', '
                                闲鱼出售全新ipad pro 2018翻车日记                            ']

 估计是因为text是
 ... 
..., 每一个换行都是一个记录？
-- text() selects all text node children of the context node.  from https://www.w3.org/TR/1999/REC-xpath-19991116/#section-String-Functions
也就是说text()会返回子节点所有的内容，因为h1下面还有个span，而且"闲鱼出售全新ipad pro 2018翻车日记“在span下面。


3. scrapy 2.0.1  
scrapy原先输出在console的日志是：

2020-05-28 22:56:06,765 - smzdm_jingxuan - INFO - smzdm_jingxuan spider starting

想在输出的日志改变下格式，把行号打印出来，首先想到的是改logging.basicConfig

  class SmzdmSpider(scrapy.Spider):

    name = 'smzdm_jingxuan'
    allowed_domains = ['spider.smzdm']
    start_urls = ("http://books.toscrape.com/",)
    # logging.basicConfig(level=logging.INFO, format='%(asctime)s %(pathname)s %(filename)s %(funcName)s %(lineno)d 
    #   %(levelname)s - %(message)s", "%Y-%m-%d %H:%M')

    logging.basicConfig(
        format='%(asctime)s,%(msecs)d %(levelname)-8s [%(pathname)s:%(lineno)d in function % (funcName)s] % (message)s',
        datefmt = '%Y-%m-%d:%H:%M:%S',
        level = logging.INFO)

    logger = logging.getLogger(__name__)

输出在console的日志是

2020-05-28 22:53:25,197 - smzdmCrawler.spiders.smzdm_jingxuan - INFO - smzdm_jingxuan spider starting

改变了一点点，但没有输出行号，结果也和设置不符。（不知道原因）

网上查了下，发现setting.py可以设置LOG_FILE，LOG_ENABLELOG_FORMAT这些日志参数，设置成

IMAGES_STORE = '/Users/gaoxianghu/temp/image'

LOG_FILE = '/Users/gaoxianghu/temp/scrapy_log.log'

LOG_ENABLED = False

LOG_FORMAT = '[%(asctime)s] p%(process)s {%(pathname)s:%(lineno)d} %(levelname)s - %(message)s'

发现LOG_ENABLED = False不生效，不管是console还是日志文件还是有日志输出，但日志文件的日志格式已经按照LOG_FORMAT的打印，但console里的还是没变（不知道原因）

LOG_ENABLED = False不生效的原因，网上有人说这饿做可以

logging.getLogger('scrapy').propagate = False

日志文件里日志格式为

[2020-05-28 22:16:12] p41288 {/Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler/spiders/smzdm_jingxuan.py:38} INFO - smzdm_jingxuan spider starting

后来发现要把 logging配置写在setting.py里就能改变console的日志了

setting.py

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(pathname)s %(filename)s %(funcName)s %(lineno)d %(levelname)s - %(message)s', )

console输出

2020-05-28 23:30:45,008 /Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler/spiders/smzdm_jingxuan.py smzdm_jingxuan.py parse 34 INFO - smzdm_jingxuan spider starting

3. 在scrapy 的 spider中用到relative import时，执行scrapy crawl smzdm_jingxuan 时报：scrapy attempted relative import with no known parent package. 原来用 from smzdmCrawler.items import SmzdmItem时没问题

from .. import items

代码结构为：

smzdmCrawler
｜--model
｜--spider
｜--｜--smzdm_jingxuan.py
｜--items.py
｜--__init__.py
｜--main.py

因为smzdmCrawler下已经有了__init__.py，所以smzdmCrawler是一个包，网上查说包不包的由__name__决定，我这边的情况是和执行scrapy crawl smzdm_jingxuan时的目录有关,原先的位置是：/Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler，改在/Users/gaoxianghu/git/cheap/smzdmCrawler就没问题

4. 在用scrapyd部署服务时，要注意此时的程序无法读取环境变量，用scrapyd-deploy部署后会先把代码解释一遍，如果此时因为无法读取环境变量而报错，比如如下代码

SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None)

# 这里只有线上才会传LOG_FILE
if LOG_FILE:
   log_file = LOG_FILE
   image_file = '/data/image/' + today_str
else:
   if SCRAPY_ENV == None:
      log_file = '/Users/gaoxianghu/temp/scraping.log'
      image_file = '/Users/gaoxianghu/temp/image/' + today_str
   else:
      log_file = '/data/log/scrapy/scraping.log'
      image_file = '/data/image/' + today_str

logHandler = TimedRotatingFileHandler(log_file, when='midnight', interval=1)

虽然我在服务器上设了环境变量'SCRAPY_ENV'，但因为不是在命令行环境，无法读取，导致log_file = '/Users/gaoxianghu/temp/scraping.log'这个路径不存在，报错。然后我用

curl http://david_scrapyd:david_2021@42.192.51.99:6801/schedule.json -d project=smzdmCrawler -d spider=smzdm_single -d setting=LOG_FILE=/data/log/scrapy/scraping.log 来执行scrapy，但这里虽然设置了setting，按照代码应该不会把log_file设置为'/Users/gaoxianghu/temp/scraping.log'，但还是报同样的错，只不过看看日志运行的是一个临时文件代码，为什么还是报错暂且不太清楚，因为在本地起服务验证是可以读到传入的LOG_FILE的。是不是因为部署的时候解释没通过，所以还是会先解释一遍再运行导致报错。

File "/tmp/smzdmCrawler-1614340245-de7610pr.egg/smzdmCrawler/settings.py"
FileNotFoundError: [Errno 2] No such file or directory: '/Users/gaoxianghu/temp/scraping.log'

5. 用scrapyd时还需要注意，根据https://github.com/scrapy/scrapyd-client#scrapyd-deploy，上面说

You may want to keep certain settings local and not have them deployed to Scrapyd. To accomplish this you can create a local_settings.py file at the root of your project, where your scrapy.cfg file resides, and add the following to your project's settings:

try:
    from local_settings import *
except ImportError:
    pass
scrapyd-deploy doesn't deploy anything outside of the project module, so the local_settings.py file won't be deployed.

这里根据亲示,在本地部署scrapyd，将scrapy部署到本地时，local_setting是可以访问到的，到egg里却没有。这里比较奇怪，部署到远程的scrapyd就访问不到。作者这句话应该是针对远程来说的

喜欢艺术的码农