小试牛刀——爬topit.me的图片，附github简易上传教程

接触了scrapy ，发现爬虫效率高了许多，借鉴大神们的文章，做了一个爬虫练练手：

我的环境是：Ubuntu14.04 + python 2.7 + scrapy 0.24

一、创建project

1 scrapy startproject topit

二、定义Item

import scrapy


class TopitItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()

三、在spider 文件夹中创建 topit_spider.py

# -*- coding: utf-8 -*-

#!/usr/bin/env python
#File name :topit_spider.py
#Author:Mellcap


from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from topit.items import TopitItem
import re
from scrapy.http import Request
from scrapy.selector import Selector


class TopitSpider(CrawlSpider):
    name = "topit"
    allowed_domains = ["topit.me"]
    start_urls=["http://www.topit.me/"]
    rules = (Rule(SgmlLinkExtractor(allow=('/item/d*')),  callback = 'parse_img', follow=True),)
    def parse_img(self, response):
        urlItem = TopitItem()
        sel = Selector(response)
        for divs in sel.xpath('//a[@rel="lightbox"]'):
            img_url=divs.xpath('.//img/@src').extract()[0]
            urlItem['url'] = img_url
            yield urlItem

四、定义pipelines

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html



from topit.items import TopitItem

class TopitPipeline(object):
    def __init__(self):
        self.mfile = open('test_topit.html', 'w')
    def process_item(self, item, spider):
        text = '<img src="' + item['url'] + '" alt = "" />'
        self.mfile.writelines(text)
    def close_spider(self, spider):
        self.mfile.close()

五、设置一下 setting.py

在后面加入一行：

ITEM_PIPELINES={'topit.pipelines.TopitPipeline': 1,}

保存后就大功告成了/

接着打开终端

运行：

cd topit
scrapy crawl topit

然后会在topit文件夹中发现test_topit 文件

打开之后在浏览器就可以看到图片了

接下来传到Github上：

爬虫已经做好了，在远程建立一个空库

一、

二、在本地建立版本库

theone@Mellcap:~$ cd topit
theone@Mellcap:~/topit$ git init
初始化空的 Git 版本库于 /home/theone/topit/.git/
theone@Mellcap:~/topit$ git status
位于分支 master

初始提交

未跟踪的文件:
  （使用 "git add <file>..." 以包含要提交的内容）

    scrapy.cfg
    test_topit.html
    topit/

提交为空，但是存在尚未跟踪的文件（使用 "git add" 建立跟踪）
theone@Mellcap:~/topit$ git add scrapy.cfg
theone@Mellcap:~/topit$ git add topit/
theone@Mellcap:~/topit$ git commit -m'scrapy_topit'

三、跟远程库建立连接

theone@Mellcap:~/topit$ git remote add origin git@github.com:Mellcap/scrapy_topit.git
theone@Mellcap:~/topit$ git push -u origin master

四、完成

在github上看到了自己的爬虫了。