pymongodb的使用和一个腾讯招聘爬取的案例

一.在python3中操作mongodb

  1.连接条件

  • 安装好pymongo库
  • 启动mongodb的服务端(如果是前台启动后就不关闭窗口,窗口关闭后服务端也会跟着关闭)

  3.使用

import pymongo
#连接mongodb需要使用里面的mongoclient,一般来说传入mongodb的ip和端口即可
#第一个参数为host,,第二个为ip.默认为27017,
client=pymongo.MongoClient(host='127.0.0.1',port=27017)
#这样就可以拿到一个客户端对象了
#另外MongoClient的第一个参数host还可以直接传MongoDB的连接字符串,以mongodb开头,
#例如:client = MongoClient('mongodb://localhost:27017/')可以达到同样的连接效果
# print(client)

###################指定数据库
db=client.test
#也可以这样写
# db=client['test']


##################指定集合
collections=db.student
#也可以这样写
# collections=db['student']

###################插入数据
# student={
# 'id':'1111',
# 'name':'xiaowang',
# 'age':20,
# 'sex':'boy',
# }
#
# res=collections.insert(student)
# print(res)
#在mongodb中,每一条数据其实都有一个_id属性唯一标识,
#如果灭有显示指明_id,mongodb会自动产生yigeObjectId类型的_id属性
#insert执行后的返回值就是_id的值,5c7fb5ae35573f14b85101c0


#也可以插入多条数据
# student1={
# 'name':'xx',
# 'age':20,
# 'sex':'boy'
# }
#
# student2={
# 'name':'ww',
# 'age':21,
# 'sex':'girl'
# }
# student3={
# 'name':'xxx',
# 'age':22,
# 'sex':'boy'
# }
#
# result=collections.insertMany([student1,student2,student3])
# print(result)
#这边的返回值就不是_id,而是insertoneresult对象
#我们可以通过打印insert_id来获取_id

#insert方法有两种
#insert_one,insertMany,一个是单条插入,一个是多条插入,以列表形式传入
#也可以直接inset(),如果是单个就直接写,多个还是以列表的形式传入


###################查找 单条查找
# re=collections.find_one({'name':'xx'})
# print(re)
# print(type(re))
#{'_id': ObjectId('5c7fb8d535573f13f85a6933'), 'name': 'xx', 'age': 20, 'sex': 'boy'}
# <class 'dict'>


#####################多条查找
# re=collections.find({'name':'xx'})
# print(re)
# print(type(re))
# for r in re:
# print(r)
#结果是一个生成器,我们可以遍历里面的这个对象,拿到里面的值
# <pymongo.cursor.Cursor object at 0x000000000A98E630>
# <class 'pymongo.cursor.Cursor'>


# re=collections.find({'age':{'$gt':20}})
# print(re)
# print(type(re))
# for r in re:
# print(r)
# 在这里查询的条件键值已经不是单纯的数字了,而是一个字典,其键名为比较符号$gt,意思是大于,键值为20,这样便可以查询出所有
# 年龄大于20的数据。

# 在这里将比较符号归纳如下表:
"""
符号含义示例
$lt小于{'age': {'$lt': 20}}
$gt大于{'age': {'$gt': 20}}
$lte小于等于{'age': {'$lte': 20}}
$gte大于等于{'age': {'$gte': 20}}
$ne不等于{'age': {'$ne': 20}}
$in在范围内{'age': {'$in': [20, 23]}}
$nin不在范围内{'age': {'$nin': [20, 23]}}
"""

#正则匹配来查找
# re = collections.find({'name': {'$regex': '^x.*'}})
# print(re)
# print(type(re))
# for r in re:
# print(r)

# 在这里将一些功能符号再归类如下:
"""
符号含义示例示例含义
$regex匹配正则{'name': {'$regex': '^M.*'}}name以M开头
$exists属性是否存在{'name': {'$exists': True}}name属性存在
$type类型判断{'age': {'$type': 'int'}}age的类型为int
$mod数字模操作{'age': {'$mod': [5, 0]}}年龄模5余0
$text文本查询{'$text': {'$search': 'Mike'}}text类型的属性中包含Mike字符串
$where高级条件查询{'$where': 'obj.fans_count == obj.follows_count'}自身粉丝数等于关注数
"""

################计数
# count=collections.find({'age':{'$gt':20}}).count()
# print(count)


#################排序
# result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING)
# print([re['name'] for re in result])


########### 偏移,可能想只取某几个元素,在这里可以利用skip()方法偏移几个位置,比如偏移2,就忽略前2个元素,得到第三个及以后的元素。
# result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING).skip(1)
# print([re['name'] for re in result])


##################另外还可以用limit()方法指定要取的结果个数,示例如下:
# results = collections.find().sort('age', pymongo.ASCENDING).skip(1).limit(2)
# print([result['name'] for result in results])

# 值得注意的是,在数据库数量非常庞大的时候,如千万、亿级别,最好不要使用大的偏移量来查询数据,很可能会导致内存溢出,
# 可以使用类似find({'_id': {'$gt': ObjectId('593278c815c2602678bb2b8d')}}) 这样的方法来查询,记录好上次查询的_id。


################################数据更新
# 对于数据更新要使用update方法
# condition={'name':'xx'}
# student=collections.find_one(condition)
# student['age']=100
# result=collections.update(condition,student)
# print(result)

# 在这里我们将name为xx的数据的年龄进行更新,首先指定查询条件,然后将数据查询出来,修改年龄,
# 之后调用update方法将原条件和修改后的数据传入,即可完成数据的更新。
# {'ok': 1, 'nModified': 1, 'n': 1, 'updatedExisting': True}
# 返回结果是字典形式,ok即代表执行成功,nModified代表影响的数据条数。

# 另外update()方法其实也是官方不推荐使用的方法,在这里也分了update_one()方法和update_many()方法,用法更加严格,
# 第二个参数需要使用$类型操作符作为字典的键名,我们用示例感受一下。

# condition={'name':'xx'}
# student=collections.find_one(condition)
# print(student)
# student['age']=112
# result=collections.update_one(condition,{'$set':student})
# print(result)
# print(result.matched_count,result.modified_count)

#再看一个例子
# condition={'age':{'$gt':20}}
# result=collections.update_one(condition,{'$inc':{'age':1}})
# print(result)
# print(result.matched_count,result.modified_count)
# 在这里我们指定查询条件为年龄大于20,
# 然后更新条件为{'$inc': {'age': 1}},执行之后会讲第一条符合条件的数据年龄加1。
# <pymongo.results.UpdateResult object at 0x000000000A99AB48>
# 1 1

# 如果调用update_many()方法,则会将所有符合条件的数据都更新,示例如下:

condition = {'age': {'$gt': 20}}
result = collections.update_many(condition, {'$inc': {'age': 1}})
print(result)
print(result.matched_count, result.modified_count)
# 这时候匹配条数就不再为1条了,运行结果如下:

# <pymongo.results.UpdateResult object at 0x10c6384c8>
# 3 3
# 可以看到这时所有匹配到的数据都会被更新。


# ###############删除
# 删除操作比较简单,直接调用remove()方法指定删除的条件即可,符合条件的所有数据均会被删除,示例如下:

# result = collections.remove({'name': 'Kevin'})
# print(result)
# 运行结果:

# {'ok': 1, 'n': 1}
# 另外依然存在两个新的推荐方法,delete_one()和delete_many()方法,示例如下:

# result = collections.delete_one({'name': 'Kevin'})
# print(result)
# print(result.deleted_count)
# result = collections.delete_many({'age': {'$lt': 25}})
# print(result.deleted_count)
# # 运行结果:

# <pymongo.results.DeleteResult object at 0x10e6ba4c8>
# 1
# 4
# delete_one()即删除第一条符合条件的数据,delete_many()即删除所有符合条件的数据,返回结果是DeleteResult类型,
# 可以调用deleted_count属性获取删除的数据条数。


# 更多
# 另外PyMongo还提供了一些组合方法,如find_one_and_delete()、find_one_and_replace()、find_one_and_update(),
# 就是查找后删除、替换、更新操作,用法与上述方法基本一致。

二.爬取腾讯招聘

  爬虫文件

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem


class TencentSpider(scrapy.Spider):
    name = 'tencent'
    # allowed_domains = ['www.xxx.com']
    #指定基础url用来做拼接用的
    base_url = 'http://hr.tencent.com/position.php?&start='
    page_num = 0
    start_urls = [base_url + str(page_num)]

    def parse(self, response):
        tr_list = response.xpath("//tr[@class='even' ] | //tr[@class='odd']")
        #先拿到存放类目的标签列表,然后循环标签列表
        for tr in tr_list:
            name = tr.xpath('./td[1]/a/text()').extract_first()
            url = tr.xpath('./td[1]/a/@href').extract_first()
            #在工作类别的时候,有时候是空值,会报错,需要这样直接给他一个空值
            # if len(tr.xpath("./td[2]/text()")):
            #    worktype = tr.xpath("./td[2]/text()").extract()[0].encode("utf-8")
            # else:
            #     worktype = "NULL"
            #如果不报错就用这种
            worktype = tr.xpath('./td[2]/text()').extract_first()
            num = tr.xpath('./td[3]/text()').extract_first()
            location = tr.xpath('./td[4]/text()').extract_first()
            publish_time = tr.xpath('./td[5]/text()').extract_first()

            item = TencentItem()
            item['name'] = name
            item['worktype'] = worktype
            item['url'] = url
            item['num'] = num
            item['location'] = location
            item['publish_time'] = publish_time
            print('----', name)
            print('----', url)
            print('----', worktype)
            print('----', location)
            print('----', num)
            print('----', publish_time)

            yield item

        # 分页处理:方法一
        # 这是第一中写法,在知道他的页码的情况下使用
        # 适用场景,在没有下一页可以点击,只能通过url拼接的情况
        # if self.page_num<3060:
        #     self.page_num+=10
        #     url=self.base_url+str(self.page_num)
        #     # yield  scrapy.Request(url=url,callback=self.parse)
        #     yield  scrapy.Request(url, callback=self.parse)

        # 方法二:
        # 直接提取的他的下一页连接
        # 这个等于0,说明不是最后一页,可以继续下一页,否则不等于0就继续提取
        #获取下一页的url直接拼接就可以了
        if len(response.xpath("//a[@id='next' and @class='noactive']")) == 0:
            next_url = response.xpath('//a[@id="next"]/@href').extract_first()
            url = 'https://hr.tencent.com/' + next_url
            yield scrapy.Request(url=url, callback=self.parse)
爬虫文件

  pipeline

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
import json
from redis import Redis
import pymongo
#存储到本地
class TencentPipeline(object):
    f=None
    def open_spider(self,spider):
        self.f=open('./tencent2.txt','w',encoding='utf-8')
    def process_item(self, item, spider):
        self.f.write(item['name']+':'+item['url']+':'+item['num']+':'+item['worktype']+':'+item['location']+':'+item['publish_time']+'
')
        return item
    def close_spider(self,spider):
        self.f.close()
#存储到mysql
class TencentPipelineMysql(object):

    conn=None
    cursor=None
    def open_spider(self,spider):
        self.conn=pymysql.connect(host='127.0.0.1',port=3306,user='root',password='123',db='tencent')
    def process_item(self,item,spider):
        print('这是mydql.米有进来吗')
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute('insert into tencent values("%s","%s","%s","%s","%s","%s")'%(item['name'],item['worktype'],item['url'],item['num'],item['publish_time'],item['location']))
            self.conn.commit()
        except Exception as  e:
            print('错误提示',e)
            self.conn.rollback()
        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()


#储存到redis
class TencentPipelineRedis(object):
    conn=None
    def open_spider(self,spider):
        self.conn=Redis(host='127.0.0.1',port=6379)

    def process_item(self,item,spider):
        item_dic=dict(item)
        item_json=json.dumps(item_dic)
        self.conn.lpush('tencent',item_json)
        return item

#存储到mongodb
class TencentPipelineMongo(object):
    client=None
    def open_spider(self,spider):
        self.client=pymongo.MongoClient(host='127.0.0.1',port=27017)
        self.db=self.client['test']

    def process_item(self,item,spider):
        collection = self.db['tencent']
        item_dic=dict(item)
        collection.insert(item_dic)

        return item

    def close_spider(self,spider):
        self.client.close()
pipeline

  settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for Tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tencent'

SPIDER_MODULES = ['Tencent.spiders']
NEWSPIDER_MODULE = 'Tencent.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False


# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'Tencent.pipelines.TencentPipeline': 300,
    'Tencent.pipelines.TencentPipelineMysql': 301,
    'Tencent.pipelines.TencentPipelineRedis': 302,
    'Tencent.pipelines.TencentPipelineMongo': 303,

}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
View Code

  item

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name=scrapy.Field()
    url=scrapy.Field()
    worktype=scrapy.Field()
    location=scrapy.Field()
    num=scrapy.Field()
    publish_time=scrapy.Field()
View Code

 

原文地址:https://www.cnblogs.com/tjp40922/p/10486317.html