在用scrapy时遇到的坑

1. 一开始是想用scrapy和selenium来爬什么值得买,结果遇到了一个奇怪的问题,直接上代码

   def start_requests(self):
        self.logger.info("starting")
        broswer = webdriver.Firefox()
        broswer.get(self.start_url)
        last_height = broswer.execute_script("return document.body.scrollHeight")
        print(last_height)
        count = 0
        while True:
            print(count)
            if count==2:
                break
            broswer.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)
            new_height = broswer.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                break
            last_height = new_height
            time.sleep(1.2)
            count = count + 1
        source = broswer.page_source
        broswer.close()

        scrapy_selector = Selector(text = source)
        items_selector = scrapy_selector.xpath('//div[@class="z-feed-content"]')
        self.logger.info('Theres a total of ' + str(len(items_selector)) + ' links.')
        try:
            s=0
            for item_selector in items_selector:
                print(s)
                print(item_selector.getall())
#错误写法(Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.)
                # url_selector = item_selector.xpath('//h5[@class="feed-block-title has-price"]/a/@href')

                # 错误写法(multiple class should be: *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')])
                # url_selector = item_selector.xpath('.//h5[@class="feed-block-title has-price"]/a/@href')

                url_selector = item_selector.xpath(".//h5[contains(concat(' ', normalize-space(@class), ' '), 'feed-block-title')]/a/@href")

# assert isinstance(url_selector, scrapy.selector.Selector)
                print(url_selector.extract())
                # self.logger.info("sss" + url)

                url = url_selector.get()
                s = s + 1
                # self.logger.info("sss" + url)
        except Exception as e:
            self.logger.info('Reached last iteration #' + str(e) + str(s))

        return
broswer.page_source表示浏览器上的整个页面的html代码,scrapy_selector是建立在整个页面的选择器,items_selector是页面上抓下来的表示首页上商品信息的列表div块的选择器列表,这些都没问题。但出问题的是标成大红色的那段代码
url_selector = item_selector.xpath('//h5[@class="feed-block-title has-price"]/a/@href')
item_selector表示每一个商品信息div块的选择器,这个用print(item_selector.getall())打印出来是对的,出问题的是url_selector,表示div块里商品链接的选择器,print(url_selector.extract())打印出来发现是个url列表,共18个
['https://www.smzdm.com/p/20610761/#hfeeds', 'https://www.smzdm.com/p/20601553/#hfeeds', 'https://www.smzdm.com/p/20597500/#hfeeds', 'https://www.smzdm.com/p/20603303/#hfeeds', 'https://www.smzdm.com/p/20613198/#hfeeds', 'https://www.smzdm.com/p/20601438/#hfeeds', 'https://www.smzdm.com/p/20615602/#hfeeds', 'https://www.smzdm.com/p/20596520/#hfeeds', 'https://www.smzdm.com/p/20617429/#hfeeds', 'https://www.smzdm.com/p/20607426/#hfeeds', 'https://www.smzdm.com/p/20615296/#hfeeds', 'https://www.smzdm.com/p/20618224/#hfeeds', 'https://www.smzdm.com/p/20603149/#hfeeds', 'https://www.smzdm.com/p/20604376/#hfeeds', 'https://www.smzdm.com/p/20603224/#hfeeds', 'https://www.smzdm.com/p/20615599/#hfeeds', 'https://www.smzdm.com/p/20615846/#hfeeds', 'https://www.smzdm.com/p/20586712/#hfeeds']
item_selector共有60个,每个里面的url_selector都是一个一样的18个元素的列表(为什么是从头上数下来的18个,不得而知)。
错误原因后来在官网的文档上发现这样一句话
Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.)
If you use @class='someclass' you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass') to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass.

原因很明白了,如果xpath以/或//开始,会从整个文档解析而不是从那个item_selector开始解析。然后改成了

url_selector = item_selector.xpath('.//h5[@class="feed-block-title has-price"]/a/@href')

发现取不到值,后来在官网上又看到了一句话:

Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose: *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ').

If you use @class='someclass' you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass') to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass.

还是比较麻烦的,改成下面这样就能正确取到url了
url_selector = item_selector.xpath(".//h5[contains(concat(' ', normalize-space(@class), ' '), 'feed-block-title')]/a/@href")
 由于这样比较麻烦,官网建议取css时用下面的形式,先用css的取法,然后再加xpath
>>> sel.css('.shout').xpath('./time/@datetime').getall()

 2. 页面html是

<h1 class="item-name">
                                <span class="edit_interface"></span>
                                闲鱼出售全新ipad pro 2018翻车日记                            </h1>
goods_scrapy_selector.xpath("//article/h1/text()")取出来的却是一个数组['
                                ', '
                                闲鱼出售全新ipad pro 2018翻车日记                            ']
 估计是因为text是
 ... 
..., 每一个换行都是一个记录?
-- text() selects all text node children of the context node. from https://www.w3.org/TR/1999/REC-xpath-19991116/#section-String-Functions
也就是说text()会返回子节点所有的内容,因为h1下面还有个span,而且"
闲鱼出售全新ipad pro 2018翻车日记“在span下面。

3. scrapy 2.0.1
scrapy原先输出在console的日志是:
2020-05-28 22:56:06,765 - smzdm_jingxuan - INFO - smzdm_jingxuan spider starting

想在输出的日志改变下格式,把行号打印出来,首先想到的是改logging.basicConfig
  class SmzdmSpider(scrapy.Spider):

    name = 'smzdm_jingxuan'
    allowed_domains = ['spider.smzdm']
    start_urls = ("http://books.toscrape.com/",)
    # logging.basicConfig(level=logging.INFO, format='%(asctime)s %(pathname)s %(filename)s %(funcName)s %(lineno)d 
    #   %(levelname)s - %(message)s", "%Y-%m-%d %H:%M')

    logging.basicConfig(
        format='%(asctime)s,%(msecs)d %(levelname)-8s [%(pathname)s:%(lineno)d in function % (funcName)s] % (message)s',
        datefmt = '%Y-%m-%d:%H:%M:%S',
        level = logging.INFO)

    logger = logging.getLogger(__name__)

输出在console的日志是

2020-05-28 22:53:25,197 - smzdmCrawler.spiders.smzdm_jingxuan - INFO - smzdm_jingxuan spider starting

改变了一点点,但没有输出行号,结果也和设置不符。(不知道原因)

网上查了下,发现setting.py可以设置LOG_FILELOG_ENABLELOG_FORMAT这些日志参数,设置成

IMAGES_STORE = '/Users/gaoxianghu/temp/image'

LOG_FILE = '/Users/gaoxianghu/temp/scrapy_log.log'

LOG_ENABLED = False

LOG_FORMAT = '[%(asctime)s] p%(process)s {%(pathname)s:%(lineno)d} %(levelname)s - %(message)s'

发现LOG_ENABLED = False不生效,不管是console还是日志文件还是有日志输出,但日志文件的日志格式已经按照LOG_FORMAT的打印,但console里的还是没变(不知道原因)

LOG_ENABLED = False不生效的原因,网上有人说这饿做可以


logging.getLogger('scrapy').propagate = False
日志文件里日志格式为
[2020-05-28 22:16:12] p41288 {/Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler/spiders/smzdm_jingxuan.py:38} INFO - smzdm_jingxuan spider starting

 后来发现要把 logging配置写在setting.py里就能改变console的日志了

setting.py
logging.basicConfig(level
=logging.DEBUG, format='%(asctime)s %(pathname)s %(filename)s %(funcName)s %(lineno)d %(levelname)s - %(message)s', )

console输出

2020-05-28 23:30:45,008 /Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler/spiders/smzdm_jingxuan.py smzdm_jingxuan.py parse 34 INFO - smzdm_jingxuan spider starting

3. 在scrapy 的 spider中用到relative import时,执行scrapy crawl smzdm_jingxuan 时报:scrapy attempted relative import with no known parent package. 原来用 from smzdmCrawler.items import SmzdmItem时没问题

from .. import items

代码结构为:

smzdmCrawler
|--model
|--spider
|--|--smzdm_jingxuan.py
|--items.py
|--__init__.py
|--main.py

因为smzdmCrawler下已经有了__init__.py,所以smzdmCrawler是一个包,网上查说包不包的由__name__决定,我这边的情况是和执行scrapy crawl smzdm_jingxuan时的目录有关,原先的位置是:/Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler,改在/Users/gaoxianghu/git/cheap/smzdmCrawler就没问题

4. 在用scrapyd部署服务时,要注意此时的程序无法读取环境变量,用scrapyd-deploy部署后会先把代码解释一遍,如果此时因为无法读取环境变量而报错,比如如下代码

SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None)

# 这里只有线上才会传LOG_FILE
if LOG_FILE:
   log_file = LOG_FILE
   image_file = '/data/image/' + today_str
else:
   if SCRAPY_ENV == None:
      log_file = '/Users/gaoxianghu/temp/scraping.log'
      image_file = '/Users/gaoxianghu/temp/image/' + today_str
   else:
      log_file = '/data/log/scrapy/scraping.log'
      image_file = '/data/image/' + today_str

logHandler = TimedRotatingFileHandler(log_file, when='midnight', interval=1)

虽然我在服务器上设了环境变量'SCRAPY_ENV',但因为不是在命令行环境,无法读取,导致log_file = '/Users/gaoxianghu/temp/scraping.log'这个路径不存在,报错。然后我用

curl http://david_scrapyd:david_2021@42.192.51.99:6801/schedule.json -d project=smzdmCrawler -d spider=smzdm_single -d setting=LOG_FILE=/data/log/scrapy/scraping.log 来执行scrapy,但这里虽然设置了setting,按照代码应该不会把log_file设置为'/Users/gaoxianghu/temp/scraping.log',但还是报同样的错,只不过看看日志运行的是一个临时文件代码,为什么还是报错暂且不太清楚,因为在本地起服务验证是可以读到传入的LOG_FILE的。是不是因为部署的时候解释没通过,所以还是会先解释一遍再运行导致报错。

File "/tmp/smzdmCrawler-1614340245-de7610pr.egg/smzdmCrawler/settings.py"
FileNotFoundError: [Errno 2] No such file or directory: '/Users/gaoxianghu/temp/scraping.log'

5. 用scrapyd时还需要注意,根据https://github.com/scrapy/scrapyd-client#scrapyd-deploy,上面说

You may want to keep certain settings local and not have them deployed to Scrapyd. To accomplish this you can create a local_settings.py file at the root of your project, where your scrapy.cfg file resides, and add the following to your project's settings:

try:
    from local_settings import *
except ImportError:
    pass
scrapyd-deploy doesn't deploy anything outside of the project module, so the local_settings.py file won't be deployed.

这里根据亲示,在本地部署scrapyd,将scrapy部署到本地时,local_setting是可以访问到的,到egg里却没有。这里比较奇怪,部署到远程的scrapyd就访问不到。作者这句话应该是针对远程来说的

喜欢艺术的码农
原文地址:https://www.cnblogs.com/zjhgx/p/12771271.html