MongoDB学习笔记(三) 聚合

在数据库的实际应用中，我们常常需要使用聚合操作帮助我们处理数据，对数据进行统计和整理

这篇文章我们将会学习如何在 MongoDB 中使用聚合操作

1、聚合函数与聚合管道

使用聚合函数与聚合管道的基本语法格式如下：

db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)

常见的聚合函数如下：

聚合函数主要用于处理数据，例如求和、求平均值等，并返回最后的计算结果

操作符	描述
$sum	求和
$avg	求平均值
$min	求最小值
$max	求最大值
$first	获取第一个文档
$last	获取最后一个文档
$push	插入一个值到数组

常见的聚合管道如下：

聚合管道可以将上一个管道的处理结果传递给下一个管道继续处理

操作符	描述
$group	用于对文档进行分组
$project	用于修改文档结构，可以重命名、增加或删除字段
$match	过滤不符合条件的文档
$sort	对文档进行排序后输出
$limit	指定读取一定数量的记录数
$skip	指定跳过一定数量的记录数

好的，下面我们亲自来实验一下，首先准备好测试数据

> use university
> db.teacher.insert([
    {
        'tid': '19001',
     	'name': 'Alice',
     	'age': 32,
     	'department': 'Computer',
     	'salary': 10000
    },
    {
        'tid': '19002',
     	'name': 'Bob',
     	'age': 48,
     	'department': 'Computer',
     	'salary': 15000
    },
    {
        'tid': '19003',
     	'name': 'Alice',
     	'age': 42,
     	'department': 'Software',
     	'salary': 12000
    },
    {
        'tid': '19004',
        'name': 'Christy',
        'age': 38,
        'department': 'Software',
        'salary': 14000
    },
    {
        'tid': '19005',
        'name': 'Daniel',
        'age': 28,
        'department': 'Architecture',
        'salary': 8000
    }
])

统计所有教师的总工资

db.teacher.aggregate([
    {
        $group: {
            _id: null, // 不进行分组
            total_salary: { $sum: '$salary' } // 对 salary 字段的值进行累加
        }
    },
    {
        $project: {
            _id: 0, // 不输出 _id 字段
            total_salary: 1 // 输出 total_salary 字段
        }
    }
])

// 查询结果
// { "total_salary" : 59000 }

统计工资超过 10000 的教师的总人数

db.teacher.aggregate([
    {
        $match: {
            salary: { $gt: 10000 } // 返回 salary 字段的值大于 10000 的文档
        }
    },
    {
        $group: {
            _id: null, // 不进行分组
            total_teacher: { $sum: 1 } // 对数值 1 进行累加
        }
    },
    {
        $project: {
            _id: 0, // 不输出 _id 字段
            total_teacher: 1 // 输出 total_teacher 字段
        }
    }
])

// 查询结果
// { "total_teacher" : 3 }

统计每个学院教师的平均工资，并且按照平均工资从小到大的顺序输出

db.teacher.aggregate([
    {
        $group: {
            _id: '$department', // 以 department 字段的值进行分组
            avg_salary: { $avg: '$salary' } // 对 salary 字段的值求平均数
        }
    },
    {
        $project: {
            _id: 0, // 不输出 _id 字段
            dept_name: '$_id', // 增加 dept_name 字段，并将其值取为 _id 字段的值
            avg_salary: 1 // 输出 avg_salary 字段
        }
    },
    {
        $sort: {
            avg_salary: 1 // 按照 avg_salary 字段的值进行升序排列
        }
    }
])

// 查询结果
// { "avg_salary" : 8000, "dept_name" : "Architecture" }
// { "avg_salary" : 12500, "dept_name" : "Computer" }
// { "avg_salary" : 13000, "dept_name" : "Software" }

输出工资排名前三的教师的编号

db.teacher.aggregate([
    {
        $sort: {
            salary: -1 // 按照 salary 字段的值进行降序排列
        }
    },
    {
        $limit: 3 // 限制只能读取 3 条文档
    },
    {
        $project: {
            _id: 0, // 不输出 _id 字段
            tid: 1 // 输出 tid 字段
        }
    }
])

// 查询结果
// { "tid" : "19002" }
// { "tid" : "19004" }
// { "tid" : "19003" }

2、Map Reduce

除了聚合函数与聚合管道之外，MongoDB 中还存在另外一种更加灵活的聚合操作 —— Map Reduce

Map Reduce 是一种计算模型，它可以将大型工作分解（map）执行，然后再将结果合并（reduce）为最终结果

它的基本语法格式如下：

db.COLLECTION_NAME.aggregate(
	function() { emit(key, value) }, // map 函数，生成键值对序列，作为 reduce 函数的参数
    function(key, values) { return reduceFunction }, // reduce 函数，处理 values
    {
        query: <query>, // 指定筛选条件，只有满足条件的文档才会调用 map 函数
        sort: <function>, // 在调用 map 函数前给文档排序
        limit: <number>, // 限制发给 map 函数的文档数量
        finalize: <function>, // 在存入结果集合前修改数据
        out: <collection>, // 指定结果存放的位置，若不指定则使用临时集合
    }
)

下面我们来举一个例子

统计每个学院年龄超过 30 的教师的平均工资超过 10000 的学院，但不输出关于工资的信息

db.teacher.mapReduce(
    // 2、执行 map 函数，map 函数的核心是调用 emit 函数，提供 reduce 函数的参数
    // emit 函数的第一个参数指定需要分组的字段，第二个参数指定需要进行统计的字段
    // 这里依据 department 字段的值分组，作为 key；组合 salary 字段的值成为数组，作为 values
    // 将每个分组得到的 (key, values) 作为 reduce 函数的参数传递过去
    function() { emit(this.department, this.salary) },
    // 3、执行 reduce 函数，reduce 函数的核心是将 (key, values) 变成 (key, value)
    // 该函数的参数 (key ,values) 从 map 函数而来，并返回一个处理后的值作为 value
    // value 与 key 组合成 (key, value) 再向后传递
    // 这里返回一个使用 avg 函数对 values 求得的平均值
    function(key, values) { return Array.avg(values) },
    {
        // 1、首先执行 query，筛选掉不符合条件的文档，然后将符合条件的文档发送到 map 函数
        query: { age: { $gt: 30 } },
        // 4、执行 finalize 函数，在将结果储存到 out 集合之前进行处理
        // 该函数的参数 (key, value) 从 reduce 函数而来，并返回一个处理后的值作为 value
        // 这里将平均工资信息隐藏，即将 value 字段的值设为 null
        finalize: function(key, value) {
            return null
        },
        // 5、将最终处理后的结果存到 total_teacher 集合
        out: 'total_teacher'
    }
)

可以看到输出如下

{
	"result" : "total_teacher", // 储存结果的集合名称
	"timeMillis" : 276, // 花费的时间，单位为毫秒
	"counts" : {
		"input" : 4, // 经过筛选后发送到 map 函数的文档个数
		"emit" : 4, // 在 map 函数中处理的文档个数
		"reduce" : 2, // 在 reduce 函数中处理的文档个数
		"output" : 2 // 结果集合的文档个数
	},
	"ok" : 1
}

然后查看结果

> show collections
// teacher
// total_teacher
> db.total_teacher.find()
// { "_id" : "Computer", "value" : null }
// { "_id" : "Software", "value" : null }

【阅读更多 MongoDB 系列文章，请看 MongoDB学习笔记】