Scrapy爬虫框架之爬取校花网图片

Scrapy

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。其可以应用在数据挖掘，信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。

一、安装

注：windows平台需要依赖pywin32，请根据自己系统32/64位选择下载安装

https://sourceforge.net/projects/pywin32/files/pywin32/

linux:

yum install libxml2-devel libxslt-devel sqlite-devel

pip install lxml

pip install pyOpenSSL

pip install pysqlite

pip install Scrapy

二、基本使用

1、创建项目

运行命令:

scrapy startproject your_project_name

自动创建目录：

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py

文件说明：

scrapy.cfg 项目的配置信息，主要为Scrapy命令行工具提供一个基础的配置信息。（真正爬虫相关的配置信息在settings.py文件中）
items.py 设置数据存储模板，用于结构化数据，如：Django的Model
pipelines 数据处理行为，如：一般结构化的数据持久化
settings.py 配置文件，如：递归的层数、并发数，延迟下载等
spiders 爬虫目录，如：创建文件，编写爬虫规则

注意：一般创建爬虫文件时，以网站域名命名

2、编写爬虫

在spiders目录中新建 xiaohuar_spider.py 文件

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import scrapy
 
class XiaoHuarSpider(scrapy.spiders.Spider):
    name = "xiaohuar"
    allowed_domains = ["xiaohuar.com"]
    start_urls = [
        "http://www.xiaohuar.com/hua/",
    ]
 
    def parse(self, response):
        # print(response, type(response))
        # from scrapy.http.response.html import HtmlResponse
        # print(response.body_as_unicode())
 
        current_url = response.url
        body = response.body
        #unicode_body = response.body_as_unicode()
        print body  #爬取结果

3、运行

进入project_name目录，运行命令

scrapy crawl spider_name --nolog
 

4、递归的访问

以上的爬虫仅仅是爬去初始页，而我们爬虫是需要源源不断的执行下去，直到所有的网页被执行完毕

爬取页面中所有的图片

注：可以修改settings.py 中的配置文件，以此来指定“递归”的层数，如： DEPTH_LIMIT = 1

#!/usr/bin/env python
#encoding: utf-8
import scrapy
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
import re
import urllib
import os

class XiaoHuarSpider(scrapy.spiders.Spider):
    name = "xiaohuar"
allowed_domains = ["xiaohuar.com"]
    start_urls = [
        "http://www.xiaohuar.com/list-1-1.html",
]

    def parse(self, response):
        # 分析页面
# 找到页面中符合规则的内容（校花图片），保存
# 找到所有的a标签，再访问其他a标签，一层一层的搞下去
hxs = HtmlXPathSelector(response)
        # 如果url是 http://www.xiaohuar.com/list-1-d+.html
if re.match('http://www.xiaohuar.com/list-1-d+.html', response.url):
            items = hxs.select('//div[@class="item_list infinite_scroll"]/div')
            for i in range(len(items)):
                src = hxs.select('//div[@class="item_list infinite_scroll"]/div[%d]//div[@class="img"]/a/img/@src' % i).extract()
                name = hxs.select('//div[@class="item_list infinite_scroll"]/div[%d]//div[@class="img"]/span/text()' % i).extract()
                school = hxs.select('//div[@class="item_list infinite_scroll"]/div[%d]//div[@class="img"]/div[@class="btns"]/a/text()' % i).extract()
                if src:
                    ab_src = "http://www.xiaohuar.com" + src[0]
                    #file_name = "%s_%s.jpg" % (school[0].encode('utf-8'), name[0].encode('utf-8'))
                    #file_path = os.path.join("/Users/wupeiqi/PycharmProjects/beauty/pic", file_name)
file_name = '%d_pic.jpg'%i
                    urllib.urlretrieve(ab_src, file_name)

        # 获取所有的url，继续访问，并在其中寻找相同的url
all_urls = hxs.select('//a/@href').extract()
        for url in all_urls:
            if url.startswith('http://www.xiaohuar.com/list-1-'):
                yield Request(url, callback=self.parse)

开始爬取

#scrapy crawl xiaohuar --nolog

打包传到windows打开