python使用pymongo访问MongoDB的基本操作，以及CSV文件导出

1. 环境。

Python：3.6.1
Python IDE：pycharm
系统：win7

2. 简单示例

import pymongo

# mongodb服务的地址和端口号
mongo_url = "127.0.0.1:27017"

# 连接到mongodb，如果参数不填，默认为“localhost:27017”
client = pymongo.MongoClient(mongo_url)

#连接到数据库myDatabase
DATABASE = "myDatabase"
db = client[DATABASE]

#连接到集合(表):myDatabase.myCollection
COLLECTION = "myCollection"
db_coll = db[COLLECTION ]

# 在表myCollection中寻找date字段等于2017-08-29的记录，并将结果按照age从大到小排序
queryArgs = {'date':'2017-08-29'}
search_res = db_coll.find(queryArgs).sort('age',-1)
for record in search_res:
print(f"_id = {record['_id']}, name = {record['name']}, age = {record['age']}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
3. 要点

针对读操作，进行数据统计，尽量使用多线程，节省时间，只是要注意线程数量，会大量吃内存。
4. mongoDB的数据类型

MongoDB支持许多数据类型，如下：

字符串 - 用于存储数据的最常用的数据类型。MongoDB中的字符串必须为UTF-8。
整型 - 用于存储数值。整数可以是32位或64位，具体取决于服务器。
布尔类型 - 用于存储布尔值(true / false)值。
双精度浮点数 - 用于存储浮点值。
最小/最大键 - 用于将值与最小和最大BSON元素进行比较。
数组 - 用于将数组或列表或多个值存储到一个键中。
时间戳 - ctimestamp，当文档被修改或添加时，可以方便地进行录制。
对象 - 用于嵌入式文档。
对象 - 用于嵌入式文档。
Null - 用于存储Null值。
符号 - 该数据类型与字符串相同; 但是，通常保留用于使用特定符号类型的语言。
日期 - 用于以UNIX时间格式存储当前日期或时间。您可以通过创建日期对象并将日，月，年的 - 日期进行指定自己需要的日期时间。
对象ID - 用于存储文档的ID。
二进制数据 - 用于存储二进制数据。
代码 - 用于将JavaScript代码存储到文档中。
正则表达式 - 用于存储正则表达式。
不支持的数据类型：

python中的集合（set）
5. 对表（集合collection）的操作

import pymongo

# mongodb服务的地址和端口号
mongo_url = "127.0.0.1:27017"

# 连接到mongodb，如果参数不填，默认为“localhost:27017”
client = pymongo.MongoClient(mongo_url)
#连接到数据库myDatabase
DATABASE = "amazon"
db = client[DATABASE]

#连接到集合(表):myDatabase.myCollection
COLLECTION = "galance20170801"
db_coll = db[COLLECTION]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
5.1. 查找记录：find

（5.1.1）指定返回哪些字段
# 示例一：所有字段
# select * from galance20170801
searchRes = db_coll.find()
# 或者searchRes = db_coll.find({})
1
2
3
4
# 示例二：用字典指定要显示的哪几个字段
# select _id,key from galance20170801
queryArgs = {}
projectionFields = {'_id':True, 'key':True} # 用字典指定
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 结果{'_id': 'B01EYCLJ04', 'key': 'pro audio'}
1
2
3
4
5
6
# 示例三：用字典指定去掉哪些字段
queryArgs = {}
projectionFields = {'_id':False, 'key':False} # 用字典指定
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 结果{'activity': False, 'avgStar': 4.3, 'color': 'Yellow & Black', 'date': '2017-08-01'}
1
2
3
4
5
# 示例四：用列表指定要显示哪几个字段
# select _id,key,date from galance20170801
queryArgs = {}
projectionFields = ['key','date'] # 用列表指定，结果中一定会返回_id这个字段
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 结果{'_id': 'B01EYCLJ04', 'date': '2017-08-01', 'key': 'pro audio'}
1
2
3
4
5
6
（5.1.2）指定查询条件
（5.1.2.1）. 比较：=，！=，>, <, >=, <=
$ne：不等于(not equal)
$gt：大于(greater than)
$lt：小于(less than)
$lte：小于等于(less than equal)
$gte：大于等于(greater than equal)
1
2
3
4
5
# 示例一：相等
# select _id,key,sales,date from galance20170801 where key = 'TV & Video'
queryArgs = {'key':'TV & Video'}
projectionFields = ['key','sales','date']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 结果：{'_id': '0750699973', 'date': '2017-08-01', 'key': 'TV & Video', 'sales': 0}
1
2
3
4
5
6
# 示例二：不相等
# select _id,key,sales,date from galance20170801 where sales != 0
queryArgs = {'sales':{'$ne':0}}
projectionFields = ['key','sales','date']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 结果：{'_id': 'B01M996469', 'date': '2017-08-01', 'key': 'stereos', 'sales': 2}
1
2
3
4
5
6
# 示例三：大于
# where sales > 100
queryArgs = {'sales':{'$gt':100}}
# 结果：{'_id': 'B010OYASRG', 'date': '2017-08-01', 'key': 'Sound Bar', 'sales': 124}
1
2
3
4
# 示例四：小于
# where sales < 100
queryArgs = {'sales':{'$lt':100}}
# 结果：{'_id': 'B011798DKQ', 'date': '2017-08-01', 'key': 'pro audio', 'sales': 0}
1
2
3
4
# 示例五：指定范围
# where sales > 50 and sales < 100
queryArgs = {'sales':{'$gt':50, '$lt':100}}
# 结果：{'_id': 'B008D2IHES', 'date': '2017-08-01', 'key': 'Sound Bar', 'sales': 66}
1
2
3
4
# 示例六：指定范围，大于等于，小于等于
# where sales >= 50 and sales <= 100
queryArgs = {'sales':{'$gte':50, '$lte':100}}
# 结果：{'_id': 'B01M6DHW26', 'date': '2017-08-01', 'key': 'radios', 'sales': 50}
1
2
3
4
（5.1.2.2）. and
# 示例一：不同字段，并列条件
# where date = '2017-08-01' and sales = 100
queryArgs = {'date':'2017-08-01', 'sales':100}
# 结果：{'_id': 'B01BW2YYYC', 'date': '2017-08-01', 'key': 'Video', 'sales': 100}
1
2
3
4
# 示例二：相同字段，并列条件
# where sales >= 50 and sales <= 100
# 正确：queryArgs = {'sales':{'$gte':50, '$lte':100}}
# 错误：queryArgs = {'sales':{'$gt':50}, 'sales':{'$lt':100}}
# 结果：{'_id': 'B01M6DHW26', 'date': '2017-08-01', 'key': 'radios', 'sales': 50}
1
2
3
4
5
（5.1.2.3）. or
# 示例一：不同字段，或条件
# where date = '2017-08-01' or sales = 100
queryArgs = {'$or':[{'date':'2017-08-01'}, {'sales':100}]}
# 结果：{'_id': 'B01EYCLJ04', 'date': '2017-08-01', 'key': 'pro audio', 'sales': 0}
1
2
3
4
# 示例二：相同字段，或条件
# where sales = 100 or sales = 120
queryArgs = {'$or':[{'sales':100}, {'sales':120}]}
# 结果：
# {'_id': 'B00X5RV14Y', 'date': '2017-08-01', 'key': 'Chargers', 'sales': 120}
# {'_id': 'B0728GGX6Y', 'date': '2017-08-01', 'key': 'Glasses', 'sales': 100}
1
2
3
4
5
6
（5.1.2.4）. in，not in，all
# 示例一：in
# where sales in (100,120)
# 这个地方一定要注意，不能用List，只能用元组，因为是不可变的
# 如果用了 {'$in': [100,120]}，就会出现异常：TypeError: unhashable type: 'list'
queryArgs = {'sales':{'$in': (100,120)}}
# 结果：
# {'_id': 'B00X5RV14Y', 'date': '2017-08-01', 'key': 'Chargers', 'sales': 120}
# {'_id': 'B0728GGX6Y', 'date': '2017-08-01', 'key': 'Glasses', 'sales': 100}
1
2
3
4
5
6
7
8
# 示例二：not in
# where sales not in (100,120)
queryArgs = {'sales':{'$nin':(100,120)}}
# 结果：{'_id': 'B01EYCLJ04', 'date': '2017-08-01', 'key': 'pro audio', 'sales': 0}
1
2
3
4
# 示例三：匹配条件内所有值 all
# where sales = 100 and sales = 120
queryArgs = {'sales':{'$all':[100,120]}} # 必须同时满足
# 结果：无结果
1
2
3
4
# 示例四：匹配条件内所有值 all
# where sales = 100 and sales = 100
queryArgs = {'sales':{'$all':[100,100]}} # 必须同时满足
# 结果：{'_id': 'B01BW2YYYC', 'date': '2017-08-01', 'key': 'Video', 'sales': 100}
1
2
3
4
（5.1.2.5）. 字段是否存在
# 示例一：字段不存在
# where rank2 is null
queryArgs = {'rank2':None}
projectionFields = ['key','sales','date', 'rank2']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 结果：{'_id': 'B00ACOKQTY', 'date': '2017-08-01', 'key': '3D TVs', 'sales': 0}

# mongodb中的命令
db.categoryAsinSrc.find({'isClawered': true, 'avgCost': {$exists: false}})
1
2
3
4
5
6
7
8
9
# 示例二：字段存在
# where rank2 is not null
queryArgs = {'rank2':{'$ne':None}}
projectionFields = ['key','sales','date','rank2']
searchRes = db_coll.find(queryArgs, projection = projectionFields).limit(100)
# 结果：{'_id': 'B014I8SX4Y', 'date': '2017-08-01', 'key': '3D TVs', 'rank2': 4.0, 'sales': 0}
1
2
3
4
5
6
（5.1.2.6）. 正则表达式匹配：$regex（SQL：like）
# 示例一：关键字key包含audio子串
# where key like "%audio%"
queryArgs = {'key':{'$regex':'.*audio.*'}}
# 结果：{'_id': 'B01M19FGTZ', 'date': '2017-08-01', 'key': 'pro audio', 'sales': 1}
1
2
3
4
（5.1.2.7）. 数组中必须包含元素：$all
# 查询记录，linkNameLst是一个数组，指定linkNameLst字段必须包含 'Electronics, Computers & Office' 这个元素。
db.getCollection("2018-01-24").find({'linkNameLst': {'$all': ['Electronics, Computers & Office']}})

# 查询记录，linkNameLst是一个数组，指定linkNameLst字段必须同时包含 'Wearable Technology' 和 'Electronics, Computers & Office' 这两个元素。
db.getCollection("2018-01-24").find({'linkNameLst': {'$all': ['Wearable Technology', 'Electronics, Computers & Office']}})
1
2
3
4
5
（5.1.2.8）. 按数组大小查询
两个思路：
第一个思路：使用$where（具有很大的灵活性，但是速度会慢一些）
# priceLst是一个数组，目标是查询 len(priceLst) < 3
db.getCollection("20180306").find({$where: "this.priceLst.length < 3"})
1
2
关于$where，请参考官方文档：http://docs.mongodb.org/manual/reference/operator/query/where/。
第二个思路：判断数组中的某个指定索引的元素是否存在（会比较高效）
例如：如果要求 len(priceLst) < 3：就意味着 num[ 2 ]不存在
# priceLst是一个数组，目标是查询 len(priceLst) < 3
db.getCollection("20180306").find({'priceLst.2': {$exists: 0}})
1
2
例如：如果要求 len(priceLst) > 3：就意味着 num[ 3 ]存在
# priceLst是一个数组，目标是查询 len(priceLst) > 3
db.getCollection("20180306").find({'priceLst.3': {$exists: 1}})
1
2
（5.1.3）指定查询条件
（5.1.3.1）. 限定数量：limit
# 示例一：按sales降序排列，取前100
# select top 100 _id,key,sales form galance20170801 where key = 'speakers' order by sales desc
queryArgs = {'key':'speakers'}
projectionFields = ['key','sales']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
topSearchRes = searchRes.sort('sales',pymongo.DESCENDING).limit(100)
1
2
3
4
5
6
（5.1.3.2）. 排序：sort
# 示例二：按sales降序，rank升序
# select _id,key,date,rank from galance20170801 where key = 'speakers' order by sales desc,rank
queryArgs = {'key':'speakers'}
projectionFields = ['key','sales','rank']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# sortedSearchRes = searchRes.sort('sales',pymongo.DESCENDING) # 单个字段
sortedSearchRes = searchRes.sort([('sales', pymongo.DESCENDING),('rank', pymongo.ASCENDING)]) # 多个字段
# 结果：
# {'_id': 'B000289DC6', 'key': 'speakers', 'rank': 3.0, 'sales': 120}
# {'_id': 'B001VRJ5D4', 'key': 'speakers', 'rank': 5.0, 'sales': 120}
1
2
3
4
5
6
7
8
9
10
（5.1.3.3）. 统计：count
# 示例三：统计匹配记录总数
# select count(*) from galance20170801 where key = 'speakers'
queryArgs = {'key':'speakers'}
searchResNum = db_coll.find(queryArgs).count()
# 结果：
# 106
1
2
3
4
5
6
5.2. 添加记录

5.2.1. 单条插入

# 示例一：指定 _id，如果重复，会产生异常
ID = 'firstRecord'
insertDate = '2017-08-28'
count = 10
insert_record = {'_id':ID, 'endDate': insertDate, 'count': count}
insert_res = db_coll.insert_one(insert_record)
print(f"insert_id={insert_res.inserted_id}: {insert_record}")
# 结果：insert_id=firstRecord: {'_id': 'firstRecord', 'endDate': '2017-08-28', 'count': 10}
1
2
3
4
5
6
7
8
# 示例二：不指定 _id，自动生成
insertDate = '2017-10-10'
count = 20
insert_record = {'endDate': insertDate, 'count': count}
insert_res = db_coll.insert_one(insert_record)
print(f"insert_id={insert_res.inserted_id}: {insert_record}")
# 结果：insert_id=59ad356d51ad3e2314c0d3b2: {'endDate': '2017-10-10', 'count': 20, '_id': ObjectId('59ad356d51ad3e2314c0d3b2')}
1
2
3
4
5
6
7
5.2.2. 批量插入

# 更高效，但要注意如果指定_id，一定不能重复
# ordered = True，遇到错误 break, 并且抛出异常
# ordered = False，遇到错误 continue, 循环结束后抛出异常
insertRecords = [{'i':i, 'date':'2017-10-10'} for i in range(10)]
insertBulk = db_coll.insert_many(insertRecords, ordered = True)
print(f"insert_ids={insertBulk.inserted_ids}")
# 结果：insert_ids=[ObjectId('59ad3ba851ad3e1104a4de6d'), ObjectId('59ad3ba851ad3e1104a4de6e'), ObjectId('59ad3ba851ad3e1104a4de6f'), ObjectId('59ad3ba851ad3e1104a4de70'), ObjectId('59ad3ba851ad3e1104a4de71'), ObjectId('59ad3ba851ad3e1104a4de72'), ObjectId('59ad3ba851ad3e1104a4de73'), ObjectId('59ad3ba851ad3e1104a4de74'), ObjectId('59ad3ba851ad3e1104a4de75'), ObjectId('59ad3ba851ad3e1104a4de76')]
1
2
3
4
5
6
7
5.3. 修改记录

# 根据筛选条件_id，更新这条记录。如果找不到符合条件的记录，就插入这条记录（upsert = True）
updateFilter = {'_id': item['_id']}
updateRes = db_coll.update_one(filter = updateFilter,
update = {'$set': dict(item)},
upsert = True)
print(f"updateRes = matched:{updateRes.matched_count}, modified = {updateRes.modified_count}")
1
2
3
4
5
6
7
# 根据筛选条件，更新部分字段：i是原有字段，isUpdated是新增字段
filterArgs = {'date':'2017-10-10'}
updateArgs = {'$set':{'isUpdated':True, 'i':100}}
updateRes = db_coll.update_many(filter = filterArgs, update = updateArgs)
print(f"updateRes: matched_count={updateRes.matched_count}, "
f"modified_count={updateRes.modified_count} modified_ids={updateRes.upserted_id}")
# 结果：updateRes: matched_count=8, modified_count=8 modified_ids=None
1
2
3
4
5
6
7

5.4. 删除记录

5.4.1. 删除一条记录

# 示例一：和查询使用的条件一样
queryArgs = {'endDate':'2017-08-28'}
delRecord = db_coll.delete_one(queryArgs)
print(f"delRecord={delRecord.deleted_count}")
# 结果：delRecord=1
1
2
3
4
5
5.4.2. 批量删除

# 示例二：和查询使用的条件一样
queryArgs = {'i':{'$gt':5, '$lt':8}}
# db_coll.delete_many({}) # 清空数据库
delRecord = db_coll.delete_many(queryArgs)
print(f"delRecord={delRecord.deleted_count}")
# 结果：delRecord=2
1
2
3
4
5
6
6. 将数据库文档写入csv文件。

6.1. 标准代码

读csv文件
import csv

with open("phoneCount.csv", "r") as csvfile:
reader = csv.reader(csvfile)
# 这里不需要readlines
for line in reader:
print(f"# line = {line}, typeOfLine = {type(line)}, lenOfLine = {len(line)}")
# 输出结果如下：
line = ['850', 'rest', '43', 'NN'], typeOfLine = <class 'list'>, lenOfLine = 4
line = ['9865', 'min', '1', 'CD'], typeOfLine = <class 'list'>, lenOfLine = 4
1
2
3
4
5
6
7
8
9
10
写csv文件
# 导出数据库所有记录的标准模版
import pymongo
import csv

# 初始化数据库
mongo_url = "127.0.0.1:27017"
DATABASE = "databaseName"
TABLE = "tableName"

client = pymongo.MongoClient(mongo_url)
db_des = client[DATABASE]
db_des_table = db_des[TABLE]

# 将数据写入到CSV文件中
# 如果直接从mongod booster导出, 一旦有部分出现字段缺失，那么会出现结果错位的问题

# newline='' 的作用是防止结果数据中出现空行，专属于python3
with open(f"{DATABASE}_{TABLE}.csv", "w", newline='') as csvfileWriter:
writer = csv.writer(csvfileWriter)
# 先写列名
# 写第一行，字段名
fieldList = [
"_id",
"itemType",
"field_1",
"field_2",
"field_3",
]
writer.writerow(fieldList)

allRecordRes = db_des_table.find()
# 写入多行数据
for record in allRecordRes:
print(f"record = {record}")
recordValueLst = []
for field in fieldList:
if field not in record:
recordValueLst.append("None")
else:
recordValueLst.append(record[field])
try:
writer.writerow(recordValueLst)
except Exception as e:
print(f"write csv exception. e = {e}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
6.2. 可能出现的问题以及解决方案

6.2.1. 写csv文件编码问题

参考文章：Python UnicodeEncodeError: ‘gbk’ codec can’t encode character 解决方法 :
http://www.jb51.net/article/64816.htm

重要点：目标文件的编码是导致标题所指问题的罪魁祸首。如果我们打开一个文件，在windows下面，新文件的默认编码是gbk，这样的话，python解释器会用gbk编码去解析我们的网络数据流txt，然而txt此时已经是decode过的unicode编码，这样的话就会导致解析不了，出现上述问题。解决的办法就是，改变目标文件的编码。
解决方案：
###### 确实最推荐的做法是在open文件时，指定编码格式：
with open(f"{DATABASE}_{TABLE}.csv", "w", newline='', encoding='utf-8') as csvfileWriter:
# 就像我们在windows环境下，写csv文件时，默认编码是'gbk'，而从网上获取的数据大部分是'utf-8'，这就可能出现某些编码不兼容的问题。比如：write csv exception. e = 'gbk' codec can't encode character 'xae' in position 80: illegal multibyte sequence
1
2
3
6.2.2. 写csv文件出现空白行（存在一行间一行）

python2.x 版本
描述及解决方案，请参考：https://www.cnblogs.com/China-YangGISboy/p/7339118.html
# 为了解决这个问题，查了下资料，发现这是和打开方式有关，将打开的方法改为wb，就不存在这个问题了，也就是
在read/write csv 文件是要以binary的方式进行。
with open('result.csv','wb') as cf:
writer = csv.writer(cf)
writer.writerow(['shader','file'])
for key , value in result.items():
writer.writerow([key,value])
1
2
3
4
5
6
7
python2.x要用‘wb’模式写入的真正原因：
python2.x中写入CSV时，CSV文件的创建必须加上‘b’参数，即open('result.csv','wb')，不然会出现隔行的现象。原因是：python正常写入文件的时候，每行的结束默认添加'n’，即0x0D，而 writerow 命令的结束会再增加一个0x0D0A，因此对于windows系统来说，就是两行，而采用’ b'参数，用二进制进行文件写入，系统默认是不添加0x0D的

而且在python2.x中，str和bytes是存在很多隐性转换的，所以虽然CSV是文本文件，也是可以正常写入。
1
2
3
4
python3 版本
在python3中，str和bytes有了清晰的划分，也没有任何隐性的转换，csv 是文本格式的文件，不支持二进制的写入，所以不要用二进制模式打开文件，数据也不必转成bytes。
描述及解决方案，请参考：https://segmentfault.com/q/1010000006841656?_ea=1148776
# 解决方案就是 newline 配置成空即可
with open('result.csv', 'w', newline='') as csvfile:
1
2
总结一下：出现空白行的根本原因是Python版本问题，解决方案上python2.x中要求用‘wb’，python3.x中要求用 ‘w’ 和newline参数。
拓展：关于python3中bytes和string之间的互相转换：http://www.jb51.net/article/105064.htm
---------------------
作者：Kosmoo
来源：CSDN
原文：https://blog.csdn.net/zwq912318834/article/details/77689568
版权声明：本文为博主原创文章，转载请附上博文链接！