爬虫

1、使用scrapy_redis分布式爬取全国658城市房源信息


 打开房天下网,进入全国658房页面:http://www.fang.com/SoufunFamily.htm

 

分析发现,所有省份/城市是存于一个table标签内,一个tr标签是上面中的一行数据。

 

我们要爬取的是新房及二手房信息,需要处理的:

 1、北京的新房/二手房,跟其他城市的新房/二手房不一样 ,因为北京被默认为首城市,在新房/二手房的url中是没有带城市拼音字段的,而其他城市的都有,如下:

  北京:二手房 → http://esf.fang.com/

       新房 → http://newhouse.fang.com/house/s/

  其他城市:如广州二手房:http://gz.esf.fang.com    (带有gz字样)

         新房:http://gz.newhouse.fang.com/house/s/

 2、在上述tr标签中,每个一行,在省份的位置数据是为空的,这个也需要处理

 3、我们从table中获取到的每个城市的url链接都类似这样:http://wuhu.fang.com/  ,我们需要跟据新房/二手房的url特征拼凑成新房/二手房的正确url

 4、最后一行省份署名为‘其它’,这是海外的房源信息,我们不爬取,在代码中也需要将其排除。具体实现如下:

import scrapy
import re


class FangtianxiaSpider(scrapy.Spider):
    name = 'fangtianxia'
    allowed_domains = ['fang.com']
    start_urls = ['http://www.fang.com/SoufunFamily.htm']

    def parse(self, response):
        trs = response.css("#c02 > .table01 tr")
        province = None
        for tr in trs:
            tds = tr.xpath(".//td[not(@class)]")
            province_td = tds[0]   # 省份的标签
            province_text = province_td.xpath(".//text()").get("")  # 获取省份的内容
            province_text = re.sub(r"s","",province_text)  # 将空格替换成空
            if province_text:  # 如果为true,表示保存的是个省份
                province = province_text
            if province == '其它':  # 不爬取海外的房源
                continue
            city_td =  tds[1]
            city_links = city_td.xpath(".//a")  # 省份下面的所有a标签,存有多个市
            for city_link in city_links:
                city = city_link.xpath(".//text()").get('') # 城市名称
                city_url = city_link.xpath(".//@href").get('')  # 城市对应的链接 http://bj.fang.com/
                # 构建新房/二手房链接
                scheme,domains = city_url.split("//")
                domain = domains.split('.',1)[0]
                if "bj." in domains:  # 如果是北京的新房/二手房链接,直接用下面的url
                    newhouse_url = "http://newhouse.fang.com/house/s/"
                    esf_url = "http://esf.fang.com/"
                else:
                    # 构建新房的url链接 http://gz.newhouse.fang.com/house/s/
                    newhouse_url = scheme + '//' + domain + '.newhouse.fang.com/house/s/'
                    # 构建二手房的url链接  http://gz.esf.fang.com/
                    esf_url = scheme + '//' + domain + '.esf.fang.com/'

                    print("省份:",province)
                    print("城市:",city)
                    print("新房url:",newhouse_url)
                    print("二手房url:",esf_url)

 运行结果:

   →→→   


上面已经构建好新房/二手房url链接,但在实际操作中,北京二手房链接会重定向到你所在城市的二手房链接,如下。目前还未找到方法解决这个问题,只能暂时剔除掉北京二手房源数据的爬取

 

** 在开始新房/二手房房源数据分析及爬取前,需要先设置随机user-agent。实现代码如下:

1)安装:pip install fake-useragent

2)settings.py中配置:

DOWNLOADER_MIDDLEWARES = {
   # 'fantianxia.middlewares.FantianxiaDownloaderMiddleware': 543,
   'fantianxia.middlewares.RandomUserAgentMiddlware': 100,
  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,  # 将scrapy useragent设置为Nome,使用我们自定义的随机user-agent
}

RANDOM_UA_TYPE = "random"  # 用于download middleware中选择随机user-agent的方式

3)middlewares.py中新建类:RandomUserAgentMiddlware ,代码实现:

from fake_useragent import UserAgent   # 引入fake-useragent的UserAgent

class RandomUserAgentMiddlware(object):
    #随机更换user-agent
    def __init__(self, crawler):
        super(RandomUserAgentMiddlware, self).__init__()
        self.ua = UserAgent()  # 实例化UserAgent
        self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random")  # 从setting中获取useragent的类型(Firefox、Chrome、IE或random)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_request(self, request, spider):
        def get_ua():
            return getattr(self.ua, self.ua_type)  # 根据setting中获取的useragent类型,映射真正方法

        request.headers.setdefault('User-Agent', get_ua())  # 添加到headers中

** 新房房源数据爬取实现代码:

    def parse_newhouse(self,response):
        province,city = response.meta.get('info')
        try:  # 可能某些城市并没有新房或二手房的房源信息,这种pass不处理
            lis = response.xpath("//div[contains(@class,'nl_con')]/ul/li[not(@id)]" and "//div[contains(@class,'nl_con')]/ul/li[not(@class)]")
            for li in lis:
                name = li.xpath(".//div[@class='nlcd_name']/a/text()").get()
                if not name:  # 如果匹配不到名称,说明是广告
                    continue
                name = name.strip()
                rooms = "/".join(li.xpath(".//div[contains(@class,'house_type')]/a/text()").getall())
                area = "".join(li.xpath(".//div[contains(@class,'house_type')]/text()").getall())
                area = re.sub(r"s|-|/","",area)
                address = li.xpath(".//div[@class='address']/a/@title").get().replace(",","-")
                district_text = "".join(li.xpath(".//div[@class='address']/a//text()").getall())
                try:
                    district = re.search(r".*[(.+)].*",district_text).group(1)
                except:
                    district = None
                sale = li.xpath(".//div[contains(@class,'fangyuan')]/span/text()").get()
                price = "".join(li.xpath(".//div[@class='nhouse_price']//text()").getall())
                price = re.sub(r"s|广告",'',price)
                origin_url = li.xpath(".//div[@class='nlcd_name']/a/@href").get("")
                url = response.urljoin(origin_url)
                img = li.xpath(".//div[@class='nlc_img']/a/img[2]/@src").get("")
                img = response.urljoin(img)
                # print(img)
                item = NewHouseItem(province=province,city=city,name=name,price=price,rooms=rooms,area=area,address=address,
                                    district=district,sale=sale,origin_url=url,img=img
                                    )
                yield item

            next_url = response.xpath("//a[@class='next']/@href").get()  # 下一页
            if next_url:
                yield scrapy.Request(url=response.urljoin(next_url),callback=self.parse_newhouse,meta={"info":(province,city)})
        except:
            pass

** 二手房房源数据爬取实现代码: 

    def parse_esf(self,response):
        province, city = response.meta.get('info')
        try:
            dls = response.xpath("//div[contains(@class,'shop_list')]/dl[@dataflag]")
            for dl in dls:
                info_list = []
                item = ESFHouseItem(province=province,city=city)
                item['name'] = dl.xpath(".//p[@class='add_shop']/a/@title").get()
                infos =dl.xpath(".//p[@class='tel_shop']/text()").getall()
                infos = list(map(lambda x:re.sub(r"s","",x),infos))
                item['rooms'] = None
                item['floor'] = None
                item['toward'] = None
                item['area'] = None
                item['year'] = None
                for info in infos:
                    if "" in info:
                        item['rooms'] = info
                    elif "" in info:
                        item['floor'] = info
                    elif "" in info:
                        item['toward'] = info
                    elif "" in info:
                        item['year'] = info
                    elif "" in info:
                        item['area'] = info

                item['address'] = dl.xpath(".//p[@class='add_shop']//span/text()").get().replace(",","-")
                item['price'] = "".join(dl.xpath(".//dd[@class='price_right']/span[1]//text()").getall())
                item['unit'] = "".join(dl.xpath(".//dd[@class='price_right']/span[2]//text()").getall())
                detail_url = dl.xpath(".//h4[@class='clearfix']/a/@href").get()
                item['origin_url'] = response.urljoin(detail_url)
                yield item

            next_url = response.xpath("//div[@class='page_al']//p[1]/a/@href").get()
            yield scrapy.Request(url=response.urljoin(next_url),callback=self.parse_esf,meta={"info":(province,city)})
        except:
            pass 

item.py代码: 

class NewHouseItem(scrapy.Item):
    province = scrapy.Field()  # 省份
    city = scrapy.Field()     # 城市
    name = scrapy.Field()    # 小区名称
    price = scrapy.Field()   # 价格
    rooms = scrapy.Field()   # 几居,列表类型
    area = scrapy.Field()    # 面积
    address = scrapy.Field()  # 地址
    district = scrapy.Field()  # 行政区
    sale = scrapy.Field()    # 是否在售
    origin_url = scrapy.Field()   #房天下每个城市每个小区的详情页url
    img = scrapy.Field()  # 封面图


class ESFHouseItem(scrapy.Item):
    province = scrapy.Field()  # 省份
    city = scrapy.Field()     # 城市
    name = scrapy.Field()    # 小区名称
    rooms = scrapy.Field()   #几室几厅
    floor = scrapy.Field()   #
    toward = scrapy.Field()  # 朝向
    year = scrapy.Field()    # 年代
    address = scrapy.Field()  # 地址
    area = scrapy.Field()    # 建筑面积
    price = scrapy.Field()   # 总价
    unit = scrapy.Field()    # 单价
    origin_url = scrapy.Field()  # 原始url

pipelines.py实现数据保存到csv文件中:

import codecs

class FangPipeline(object):
    def __init__(self):
        self.newHouse_fp = codecs.open("新房源信息.csv",'w',encoding='utf-8')
        self.esfHouse_fp = codecs.open("二手房源信息.csv",'w',encoding='utf-8')
        self.newHouse_fp.write("省份,城市,小区,价格,几居,面积,地址,行政区,是否在售,origin_url,封面图url
")
        self.esfHouse_fp.write("省份,城市,小区,几室几厅,层,朝向,年代,地址,建筑面积,总价,单价,origin_url
")

    def process_item(self, item, spider):
        if 'floor' not in item:
            self.newHouse_fp.write("{},{},{},{},{},{},{},{},{},{},{}
".format(item['province'],item['city'],item['name'],item['price'],item['rooms'],
            item['area'],item['address'],item['district'],item['sale'],item['origin_url'],item['img'])
                               )
        else:
            self.esfHouse_fp.write("{},{},{},{},{},{},{},{},{},{},{},{}
".format(item['province'], item['city'], item['name'], item['rooms'], item['floor'],
                                   item['toward'], item['year'], item['address'], item['area'], item['price'],
                                   item['unit'],item['origin_url'])
                                   )
        return item

settings.py中配置pipeline: 

ITEM_PIPELINES = {
   'fantianxia.pipelines.FangPipeline': 100,
}

 fangtianxia.py完整代码:

import scrapy
import re
from items import NewHouseItem,ESFHouseItem


class FangtianxiaSpider(scrapy.Spider):
    name = 'fangtianxia'
    allowed_domains = ['fang.com']
    start_urls = ['http://www.fang.com/SoufunFamily.htm']

    def parse(self, response):
        trs = response.css("#c02 > .table01 tr")
        province = None
        for tr in trs:
            tds = tr.xpath(".//td[not(@class)]")
            province_td = tds[0]   # 省份的标签
            province_text = province_td.xpath(".//text()").get("")  # 获取省份的内容
            province_text = re.sub(r"s","",province_text)  # 将空格替换成空
            if province_text:  # 如果为true,表示保存的是个省份
                province = province_text
            if province == '其它':  # 不爬取海外的房源
                continue
            city_td =  tds[1]
            city_links = city_td.xpath(".//a")  # 省份下面的所有a标签,存有多个市
            for city_link in city_links:
                city = city_link.xpath(".//text()").get('') # 城市名称
                city_url = city_link.xpath(".//@href").get('')  # 城市对应的链接 http://bj.fang.com/
                # 构建新房/二手房链接
                scheme,domains = city_url.split("//")
                domain = domains.split('.',1)[0]
                if "bj." in domains:  # 如果是北京的新房/二手房链接,直接用下面的url
                    newhouse_url = "http://newhouse.fang.com/house/s/"
                    esf_url = "http://esf.fang.com/"

                    # print("省份:", province)
                    # print("城市:", city)
                    # print("新房url:", newhouse_url)
                    # print("二手房url:", esf_url)

                else:
                    # 构建新房的url链接 http://gz.newhouse.fang.com/house/s/
                    newhouse_url = scheme + '//' + domain + '.newhouse.fang.com/house/s/'
                    # 构建二手房的url链接  http://gz.esf.fang.com/
                    esf_url = scheme + '//' + domain + '.esf.fang.com/'

                    # print("省份:",province)
                    # print("城市:",city)
                    # print("新房url:",newhouse_url)
                    # print("二手房url:",esf_url)
                yield scrapy.Request(url=newhouse_url,callback=self.parse_newhouse,meta={"info":(province,city)})

                if esf_url == 'http://esf.fang.com/':  # 北京url,重定向问题,未解决
                    continue
                yield scrapy.Request(url=esf_url,callback=self.parse_esf,meta={"info":(province,city)})
            #     break
            # break



    def parse_newhouse(self,response):
        province,city = response.meta.get('info')
        try:  # 可能某些城市并没有新房或二手房的房源信息,这种pass不处理
            lis = response.xpath("//div[contains(@class,'nl_con')]/ul/li[not(@id)]" and "//div[contains(@class,'nl_con')]/ul/li[not(@class)]")
            for li in lis:
                name = li.xpath(".//div[@class='nlcd_name']/a/text()").get()
                if not name:  # 如果匹配不到名称,说明是广告
                    continue
                name = name.strip()
                rooms = "/".join(li.xpath(".//div[contains(@class,'house_type')]/a/text()").getall())
                area = "".join(li.xpath(".//div[contains(@class,'house_type')]/text()").getall())
                area = re.sub(r"s|-|/","",area)
                address = li.xpath(".//div[@class='address']/a/@title").get().replace(",","-")
                district_text = "".join(li.xpath(".//div[@class='address']/a//text()").getall())
                try:
                    district = re.search(r".*[(.+)].*",district_text).group(1)
                except:
                    district = None
                sale = li.xpath(".//div[contains(@class,'fangyuan')]/span/text()").get()
                price = "".join(li.xpath(".//div[@class='nhouse_price']//text()").getall())
                price = re.sub(r"s|广告",'',price)
                origin_url = li.xpath(".//div[@class='nlcd_name']/a/@href").get("")
                url = response.urljoin(origin_url)
                img = li.xpath(".//div[@class='nlc_img']/a/img[2]/@src").get("")
                img = response.urljoin(img)
                # print(img)
                item = NewHouseItem(province=province,city=city,name=name,price=price,rooms=rooms,area=area,address=address,
                                    district=district,sale=sale,origin_url=url,img=img
                                    )
                yield item

            next_url = response.xpath("//a[@class='next']/@href").get()  # 下一页
            if next_url:
                yield scrapy.Request(url=response.urljoin(next_url),callback=self.parse_newhouse,meta={"info":(province,city)})
        except:
            pass



    def parse_esf(self,response):
        province, city = response.meta.get('info')
        try:
            dls = response.xpath("//div[contains(@class,'shop_list')]/dl[@dataflag]")
            for dl in dls:
                info_list = []
                item = ESFHouseItem(province=province,city=city)
                item['name'] = dl.xpath(".//p[@class='add_shop']/a/@title").get()
                infos =dl.xpath(".//p[@class='tel_shop']/text()").getall()
                infos = list(map(lambda x:re.sub(r"s","",x),infos))
                item['rooms'] = None
                item['floor'] = None
                item['toward'] = None
                item['area'] = None
                item['year'] = None
                for info in infos:
                    if "" in info:
                        item['rooms'] = info
                    elif "" in info:
                        item['floor'] = info
                    elif "" in info:
                        item['toward'] = info
                    elif "" in info:
                        item['year'] = info
                    elif "" in info:
                        item['area'] = info

                item['address'] = dl.xpath(".//p[@class='add_shop']//span/text()").get().replace(",","-")
                item['price'] = "".join(dl.xpath(".//dd[@class='price_right']/span[1]//text()").getall())
                item['unit'] = "".join(dl.xpath(".//dd[@class='price_right']/span[2]//text()").getall())
                detail_url = dl.xpath(".//h4[@class='clearfix']/a/@href").get()
                item['origin_url'] = response.urljoin(detail_url)
                yield item

            next_url = response.xpath("//div[@class='page_al']//p[1]/a/@href").get()
            yield scrapy.Request(url=response.urljoin(next_url),callback=self.parse_esf,meta={"info":(province,city)})
        except:
            pass
View Code

至此,简单版的房天下新房/二手房房源数据爬取及持久化便完成了。运行项目,打开csv文件,可以发现(中间手动停止运行):

 新房房源数据成功爬取3万多条数据,二手房房源数据成功爬取接近4万条数据


 ** 将我们上述项目改造成scrapy-redis分布式爬虫项目

1、安装:pip install scrapy-redis

2、fangtianxia.py代码改动:

from scrapy_redis.spiders import RedisSpider

class FangtianxiaSpider(RedisSpider):  # 继承RedisSpider
    name = 'fangtianxia'
    allowed_domains = ['fang.com']
    # start_urls = ['http://www.fang.com/SoufunFamily.htm']  # 注释掉
    redis_key = "fang:start_url"  # 从redis中推入

3、settings.py配置: 

# 使用scrapy-redis里的去重组件,不使用scrapy默认的去重方式
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis里的调度器组件,不使用默认的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 允许暂停,redis请求记录不丢失
SCHEDULER_PERSIST = True
REDIS_HOST = "192.168.1.145"
# 指定数据库的端口号
REDIS_PORT = 6379
REDIS_PARAMS = {
    'password': 'nan****',
}

ITEM_PIPELINES = {
   # 'fantianxia.pipelines.FangPipeline': 100,
   'scrapy_redis.pipelines.RedisPipeline': 110,
}

 4、进入我们项目的虚拟环境中,再进入爬虫所在的路径,需要注意的是:

 1)爬虫的执行文件命名为与项目名称不同的名字,本项目重命名为sfw.py

 2)从终端进入爬虫所在路径运行爬虫时,如果报错找不到某些模块,则需要添加模块路径到我们的执行文件中:

# sfw.py文件下
import os,sys 
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0,BASE_DIR)
from items import NewHouseItem,ESFHouseItem  # 报错:items模块未找到

 路径:

 

上述问题解决后,我们开始运行爬虫文件:

scrapy runspider sfw.py

看到爬虫项目运行起来,并在等待我们的start_url输入时,再在redis-cli客户端推入我们的start_url:

lpush fang:start_url http://www.fang.com/SoufunFamily.htm

至此,scrapy-redis分布式爬虫项目,我们就算完成了。 

 查看运行结果:

 共爬取了2828页,77768个item数据

 从csv数据得知,共爬取二手房数据接近60万数据,共爬取新房数据30来万条数据

 


我们将项目打包,在linux/Ubuntu系统中构建好环境后,使用上面的方法,运行爬虫执行文件,就可以实现多个系统分布式爬取数据了。

切记运行爬虫执行文件前,需要启动redis服务器



原文地址:https://www.cnblogs.com/Eric15/p/10072033.html