【Hive】HiveQL实战之数据抽样Sample

当数据量很大时,需要查找一个数据的子集用于加快数据的分析,这种技术就是抽样技术。Hive中,数据抽样分为以下三种:
  • 随机抽样;
  • 桶表抽样;
  • 块抽样;
1 随机抽样

1)语法结构

使用Rand()和LIMIT关键字得到抽样数据,Distribute和Sort关键字确保数据在mappers和reducers之间高效的随机分布,也可以使用order by rand()实现,但是性能不好。

语法:
SELECT * FROM <Table_Name> DISTRIBUTE BY RAND() SORT BY RAND()
LIMIT <N rows to sample>;
2)示例
0: jdbc:hive2://localhost:10000/hive> select * from ana
. . . . . . . . . . . . . . . . . . > distribute by rand() sort by rand() limit 3;
+-----------+-------------+-------------+
| ana.name  | ana.depart  | ana.salary  |
+-----------+-------------+-------------+
| Mike      | 1001        | 6400        |
| Will      | 1000        | 4000        |
| Richard   | 1002        | 8000        |
+-----------+-------------+-------------+
3 rows selected (123.999 seconds)
2 桶表抽样

1)语法结构

桶表抽样是桶表已优化的特殊的抽样方法,colname指定列在哪里取样数据,当在整个行取样时,可以使用Rand()函数,如果抽样列是Clustered By列,Tablesample语句会更高效。

语法:
SELECT * FROM <Table_Name>
TABLESAMPLE(BUCKET <specified bucket number to sample> OUT OF <total
number of buckets> ON [colname|RAND()]) table_alias;
2)示例

3 块抽样

1)语法结构

块抽样允许Hive随机从数据中挑选N行、数据量的百分比或者数据的N字节大小。该种抽样的粒度是HDFS的块大小。

语法:
SELECT *
FROM <Table_Name> TABLESAMPLE(N PERCENT|ByteLengthLiteral|N ROWS) s;

-- ByteLengthLiteral
-- (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')
2)示例

示例一:按行抽样
0: jdbc:hive2://localhost:10000/hive> select name from ana  tablesample(4 rows) a;
+----------+
|   name   |
+----------+
| Lucy     |
| Michael  |
| Steven   |
| Will     |
+----------+
4 rows selected (0.29 seconds)
示例二:按数据大小的百分比抽样
0: jdbc:hive2://localhost:10000/hive> select name from ana  tablesample(10 percent) a;
+----------+
|   name   |
+----------+
| Lucy     |
| Michael  |
+----------+
2 rows selected (0.345 seconds)
示例三:按数据大小抽样
0: jdbc:hive2://localhost:10000/hive> select name from ana  tablesample(2M) a;
+----------+
|   name   |
+----------+
| Lucy     |
| Michael  |
| Steven   |
| Will     |
| Will     |
| Jess     |
| Lily     |
| Mike     |
| Richard  |
| Wei      |
| Yun      |
+----------+
11 rows selected (0.264 seconds)



原文地址:https://www.cnblogs.com/alen-liu-sz/p/12975621.html