MongoDB 笔记

MongoDB数据类型

Mongodb数据类型
null {"x":null}
Boolean {"x":true}, {"x":false}
数据类型, 在Mongodb Shell中默认使用64位浮点型数据,如{"x":2.32}、{"x":2}，如果要使用整数类型则用{"x":NumberInt(2)}、{"x":NumberLong(2)}
字符串, Mongodb中字符串采用UTF-8编码方式，{"x":"hello world"}
日期类型, {"x":new Date()}
正则表达式, Mongodb中可使用和javascript相同的正则表达式 {"x":/itbilu/i}
数据, Mongodb中数组的使用和javascript相同{"x":["hello","world"]}
内嵌文档, {"x"：{"y":"Hello"}}
Id和ObjectId(), Mongodb每个文档都会包含一个_id，如果你不指定时Mongodb会自动生成一个ObjectId对象
代码, {"x":function aa(){}}
二进制

查找

# 列出所有
db.getCollection('article').find({})

# = 条件
db.getCollection('article').find({name : 'name'})

# and条件
db.getCollection('article').find({name:'name', age:18});

# or条件
db.getCollection('article').find({$or:[{title:/release/}, {title:/Faq/}]}, {title:1})article

# in条件
db.getCollection('article_756').find({author:{$in:['david', 'Bens', 'xxh']}})

# like 条件(正则), 注意这里的正则字符串是不加引号的
db.getCollection('article').find({name : /ThisName/})
# like 忽略大小写
db.getCollection('article').find({name : /ThisName/i})
db.getCollection('article').find({_id : /^756.*/})

# 列出指定字段
db.getCollection('article').find({}, {name: 1, rank: 1})

# 不列出指定字段
db.getCollection('article').find({}, {name: 0, rank: 0})

# 排序, 1: ASC, -1: DESC
db.getCollection('article').find({}).sort({updatedAt: -1})

# 翻页 limit, skip
db.getCollection('article').find({}).limit(20).skip(20000)

注意: 排序时, 如果 MongoDB 在排序的字段上不能使用索引, 所有记录的合并大小不能超过32MB

翻页的实现:

1. 使用 .skip(m).limit(n) 这种方式下mongodb会在当前的结果集上遍历m个记录后取回n个记录, 即使当前sort使用的字段上有索引. 所以在大数据集上使用这个方式进行翻页效率是很低的.

The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

实际测试中, 在一个记录数为50K, 记录平均大小为1.2KByte的记录集上取100个记录, 不同skip所花费的时间记录为
0:0.018s
10:0.119s
100:0.648s
1K:0.573s
10K:1.44s
100K:0.939s
1M:2.17s
5M:4.93s
10M:7.52s
20M:约14s

2. 使用_id 或任何唯一索引字段, 使用 > 或 < 进行翻页, 这种方式能保证任意翻页位置的查询速度, 但是在两个方面有局限: 1)对字段有要求, 2) 只能顺序翻页, 不能随意指定页码

def idlimit(page_size, last_id=None):
    if last_id is None:
        # When it is first page
        cursor = db['students'].find().limit(page_size)
    else:
        cursor = db['students'].find({'_id': {'$gt': last_id}}).limit(page_size)

    # Get the data      
    data = [x for x in cursor]

    if not data:
        # No documents left
        return None, None

    # Since documents are naturally ordered with _id, last document will
    # have max id.
    last_id = data[-1]['_id']

    # Return data and last_id
    return data, last_id

查找不包含某字段的记录

db.getCollection('article_9').find({'content':{'$exists':false}})

修改值

修改指定记录的值

db.article.update({_id:309},{$set:{'lastPage':1}})

替换_id字段的值. 如果只是修改值, 可以在原表上修改(save一个, remove一个),

# 注意这边 _id 的类型是 int32, 所以 这个等式右边是有问题的, 最后会被转为int
db.getCollection('article').find({}).forEach( function(u) {
    var old = u._id;
    u._id = u.boardId+'.'+u._id;
    db.getCollection('article').save(u);
    db.getCollection('article').remove({_id, ObjectId(old)});
})

如果是同时修改类型和值, 不能在原表上直接修改, 要新建一个collection来处理.

注意: 这个在Robo3T里面执行会报错, 必须到命令行下面执行

# 修改collection名称
db.getCollection('article').renameCollection('article_old')

# 将新记录填入新collection
db.getCollection('article_old').find({}).forEach( function(u) {
    var newId = u.boardId.toString() +'.'+ u._id.toString();
    u._id = newId;
    u.parentId = u.boardId.toString() +'.'+ u.parentId.toString();
    db.getCollection('article').save(u);
})

修改和删除字段名

格式

db.collection.update(
   <query>,
   <update>,
   {
     upsert: <boolean>,
     multi: <boolean>,
     writeConcern: <document>
   }
)
# query : update的查询条件，类似sql update查询内where后面的。
# update : update的对象和一些更新的操作符（如$,$inc...）等，也可以理解为sql update查询内set后面的
# upsert : 可选，这个参数的意思是，如果不存在update的记录，是否插入objNew,true为插入，默认是false，不插入。
# multi : 可选，mongodb 默认是false,只更新找到的第一条记录，如果这个参数为true, 就把按条件查出来多条记录全部更新。
# writeConcern :可选，抛出异常的级别。

修改字段名

db.getCollection('article').update({},{$rename:{"COMMPP":'COMP_NAME'}},false,true)

.删除字段

//把 from等于hengduan 并且zhLatin是空的数据的zhLatin字段删除
db.getCollection('species').update({"from":"hengduan","zhLatin":null},{$unset: {'zhLatin':''}},false, true)

删除

删db

db.dropDatabase()

删collection

db.getCollection('section_to_board').drop()

删记录, 对应_id的值可以是ObjectId, string, int 等

db.getCollection('article').remove({_id: ObjectId("adfasdfadsf")})
db.getCollection('article').remove({board: 'name')})

mongodb删除集合后磁盘空间不释放, 为避免记录删除后的数据的大规模挪动, 原记录空间不删除, 只标记“已删除”, 以后还可以重复利用. 这些空间需要用修复命令db.repairDatabase() 释放. 如果在修复的过程中mongodb挂掉, 重启不了的, 可以使用./mongod --repair --dbpath=/data/mongo/ 进行修复. dbpath时就指向要修复的数据库文件目录就可以. 修复可能要花费很长的时间

索引

Mongodb 3.0.0 版本前创建索引方法为 db.collection.ensureIndex()，之后的版本使用了 db.collection.createIndex()

# 创建联合唯一索引, 方式为后台创建, 不阻塞
db.collection.ensureIndex( {"id":1,"name":1}, {background:1,unique:1} )
# 创建索引
db.collection.createIndex( { orderDate: 1 } )
# 指定索引名称, 如果未指定, MongoDB 通过连接索引的字段名和排序顺序生成一个索引名称
db.collection.createIndex( { category: 1 }, { name: "category_fr" } )
# 创建联合索引
db.collection.createIndex( { orderDate: 1, category: 1 }, { name: "date_category_fr", collation: { locale: "fr", strength: 2 } } )

# 查看集合索引
db.collection.getIndexes()
# 查看集合索引大小
db.collection.totalIndexSize()
# 删除集合所有索引
db.collection.dropIndexes()
# 删除集合指定索引
db.collection.dropIndex("索引名称")

统计

主要是count, distinct 和 group

# 统计记录数 count
db.getCollection('article').find({name:'name', age:18}).count()

# Distinct
# 格式 db.collectionName.distinct(field, query, options)
# 统计所有的记录中flag的不同值, flag要加引号
db.getCollection('article').distinct(flag)
# 带条件的distinct, 去author为gre的flag的不同值
db.getCollection('article').distinct('flag', {author: 'gre'})

# Group 实际上是一种 MapReduce 方式的统计
# 对于如下结构的数据进行统计
{
 "_id" : ObjectId("552a333f05c2b62c01cff50e"),
 "_class" : "com.mongo.model.Orders",
 "onumber" : "004",
 "date" : ISODate("2014-01-05T16:03:00Z"),
 "cname" : "zcy",
 "item" : {
   "quantity" : 5,
   "price" : 4.0,
   "pnumber" : "p002"
  }
}

# 按date和pnumber对记录进行分组, 在reduce中累计quantity, 会输出key 和out中的字段
db.orders.group({
    key: { date:1,'item.pnumber':1 },
    initial: {"total":0},
    reduce: function Reduce(doc, out) {
        out.total+=doc.item.quantity
    }
})

# 按date对记录进行分组, 在reduce中统计数量和金额, 最后再补充计算单件平均价格
db.orders.group({
    key: {date:1},
    initial: {"total":0,"money":0},
    reduce: function Reduce(doc, out) {
        out.total+=doc.item.quantity;
        out.money+=doc.item.quantity*doc.item.price;
    },
    finalize : function Finalize(out) {
        out.avg=out.money/out.total
        return out;
    }
});

注意: group命令不能在分片集合上运行, group的结果集大小不能超过16MB

聚合

执行相同的统计, aggregate 性能比group好

关键词含义, 注意以$开头的关键字, 以及字段名

$sum	计算总和。	db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : "$likes"}}}])
$avg	计算平均值	db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$avg : "$likes"}}}])
$min	获取集合中所有文档对应值得最小值。	db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$min : "$likes"}}}])
$max	获取集合中所有文档对应值得最大值。	db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$max : "$likes"}}}])
$push	在结果文档中插入值到一个数组中。	db.mycol.aggregate([{$group : {_id : "$by_user", url : {$push: "$url"}}}])
$addToSet	在结果文档中插入值到一个数组中，但不创建副本。	db.mycol.aggregate([{$group : {_id : "$by_user", url : {$addToSet : "$url"}}}])
$first	根据资源文档的排序获取第一个文档数据。	db.mycol.aggregate([{$group : {_id : "$by_user", first_url : {$first : "$url"}}}])
$last	根据资源文档的排序获取最后一个文档数据	db.mycol.aggregate([{$group : {_id : "$by_user", last_url : {$last : "$url"}}}])

# 统计文章数量

db.getCollection('article').aggregate([{$group : {_id : "$author", num_articles : {$sum : 1}}}])

# 等价于

db.getCollection('article_756').group({
    key: {author:1},
    initial: {"total":0},
    reduce: function Reduce(doc, out) {
        out.total += 1;
    },
    finalize : function Finalize(out) {
        out.avg = out.total / 100
        return out;
    }
});

# 带条件的统计

db.getCollection('article').aggregate([
    {$match: { author: 'Milton' }},
    {$group: { _id: "$boardId", total: { $sum: 1 } } },
    {$sort: { total: -1 } }
])

备份和恢复

备份数据, 使用mongodump命令, 可以指定的最小粒度为Collection, 命令行例子

mongodump -h 127.0.0.1:27017 -d demodb -c article_1 -o ./
# -h 服务器IP和端口
# -d db
# -c collection, 不指定则导出所有collection
# -o 导出文件存放路径, 默认会再增加一层db名称的目录

.导出时, 会增加一层与db同名的目录, 同时各个collection以单独的文件存放, 每个collection会生成一个bson文件和一个metadata.json文件

如果使用的是mongodb4, 还有以下参数可以使用

--gzip 输出压缩好的文件, 会在原文件名后增加.gz 后缀, 这样省得导出后再自己压缩了
--dumpDbUsersAndRoles 需要与 -d 配合使用, 同时导出用户和角色. 如果未指定db, mongodump会自动导出全部用户和角色数据
--excludeCollection string 排除指定的collection, 如果要排除多个, 需要多次使用这个参数
--excludeCollectionsWithPrefix string 排除名称包含指定前缀的collection, 如果要排除多个, 需要多次使用这个参数
--numParallelCollections int, -j int 指定并发导出的数量, 默认为4
--viewsAsCollections 将view当作collection导出, 在restore后变成collection, 如果不指定, 则只导出view的metadata, 待restore后会重建

.例如

mongodump -h 127.0.0.1:27017 --gzip -d demodb -c article -o ./
mongodump -h 127.0.0.1:27017 -d demodb -o ./ --gzip --excludeCollection=col1 --excludeCollection=col2 --excludeCollection=col3

批量下载多个collection不能直接用mongodump命令行实现, 要通过shell脚本

#!/bin/bash
db=demodb
var=$1
collection_list=${var//,/ }
host=127.0.0.1
port=27017
out_dir="./"

for collection in $collection_list; do
    echo $collection
    mongodump -h $host:$port -c $collection -d $db -o ${out_dir} --gzip
done


# 使用时, 多个collection以逗号分隔, 中间不要留空格, 例如
./dump.sh c1,c2,c3,c4,c_5,c_6

恢复数据, 使用mongorestore命令, 命令行例子

# 恢复使用--gzip导出的备份, 不能使用-c参数指定collection
mongorestore -h 127.0.0.1:27017 -d demodb --objcheck --stopOnError --gzip folder/

.如果在restore中需要指定包含和排除的collection, 要使用 --nsInclude 和 --nsExclude 参数

mongorestore --nsInclude 'transactions.*' --nsExclude 'transactions.*_dev' dump/

快速在db之间复制collection

mongodump --archive --db src_db -h 127.0.0.1:27017 --excludeCollection board --excludeCollection section --excludeCollection section_to_board --excludeCollection user | mongorestore --archive -j1 -h 127.0.0.1:27017 --nsInclude 'src_db.col_*' --nsFrom 'src_db.col_$A$' --nsTo 'tgt_db.col_$A$'

# 因为在后面已经有nsInclude, 在前半部可以不加 --excludeCollection 参数, 这样对于src_db里所有的collection只会做一个count操作, 并不会真的发生传输. 对执行时间影响不大
mongodump --archive --db src_db -h 127.0.0.1:27017 | mongorestore --archive -j1 -h 127.0.0.1:27017 --nsInclude 'src_db.col_*' --nsFrom 'src_db.col_$A$' --nsTo 'tgt_db.col_$A$'

如果需要将多个collection合并到同一个, 需要多次执行下面的语句, 其中col_305是每次需要更换的collection名称. 使用通配符会报confliction, 如果知道如何一次性导入多个, 请留言赐教.

mongodump --archive --db src_db -h 127.0.0.1:27017 | mongorestore --archive -j1 -h 127.0.0.1:27017 --nsInclude 'src_db.col_305' --nsFrom 'src_db.col_$A$' --nsTo 'tgt_db.col_all'

.这是一个用于批量合并的脚本

#!/bin/bash

if [ -z $1 ]; then
  echo $"Usage: $0 [file_name]"
  exit 2
else
    cat $1 | while read line
    do
        echo $line
	mongodump --archive -d src_db -c col_${line} -j1 -h 127.0.0.1:27017 | mongorestore --archive -j1 -h 127.0.0.1:27017 --nsInclude src_db.col_${line} --nsFrom 'src_db.col_$A$' --nsTo 'tgt_db.col_all'
    done
fi