postgresql 表的 tablesample

os:centos 7.4
postgresql:10.4

postgresql 的数据抽样是从 9.5版本开始提供，主要是解决从大量数据中随机获取一些数据的效率。

##生成数据

$ psql
psql (10.4)
Type "help" for help.

postgres=# create table test01(id integer, val char(1000)); 
CREATE TABLE
postgres=# insert into test01 values(generate_series(1,500000),repeat( chr(int4(random()*26)+65),1000));

查看普通方法的执行计划

postgres=# 	iming
Timing is on.
postgres=# explain analyze verbose select * from test01 order by random() limit 2;
                                                              QUERY PLAN                                
                              
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=44583.98..44583.98 rows=2 width=1040) (actual time=464.091..464.092 rows=2 loops=1)
   Output: id, val, (random())
   ->  Sort  (cost=44583.98..45833.97 rows=499999 width=1040) (actual time=464.090..464.090 rows=2 loops=1)
         Output: id, val, (random())
         Sort Key: (random())
         Sort Method: top-N heapsort  Memory: 31kB
         ->  Seq Scan on public.test01  (cost=0.00..39583.99 rows=499999 width=1040) (actual time=9.696..340.137 rows=500000 loops=1)
               Output: id, val, random()
 Planning time: 0.139 ms
 Execution time: 464.113 ms
(10 rows)

Time: 613.564 ms

可以看到执行计划是走的全表扫描。

select *
from tablename
[ TABLESAMPLE sampling_method ( argument [, …] ) [ REPEATABLE ( seed ) ] ]

TABLESAMPLE sampling_method ( argument [, …] ) [ REPEATABLE ( seed ) ]
table_name之后的 TABLESAMPLE子句表示应该用指定的 sampling_method 来检索表中行的子集。这种采样优先于任何其他过滤器（例如 WHERE子句）。
标准 PostgreSQL发布包括两种采样方法：BERNOULLI和SYSTEM，其他采样方法可以通过扩展安装在数据库中。

BERNOULLI以及SYSTEM采样方法都接受一个参数，它表示要采样的表的分数，表示为一个 0 到 100 之间的百分数。
这个参数可以是任意的实数值表达式（其他的采样方法可能接受更多或者不同的参数）。这两种方法都返回一个随机选取的该表采样，其中包含了指定百分数的表行。

BERNOULLI方法扫描整个表并且用指定的几率选择或者忽略行。
SYSTEM方法会做块层的采样，每个块都有指定的机会能被选中，被选中块中的所有行都会被返回。

在指定较小的采样百分数时，SYSTEM 方法要比BERNOULLI方法快很多，但是前者可能由于聚簇效应返回随机性较差的表采样。

可选的REPEATABLE子句指定一个用于产生采样方法中随机数的种子数或表达式。种子值可以是任何非空浮点值。如果查询时表没有被更改，指定相同种子和argument值的两个查询将会选择该表相同的采样。
但是不同的种子值通常将会产生不同的采样。如果没有给出REPEATABLE，则会基于一个系统产生的种子为每一个查询选择一个新的随机采样。注意有些扩展采样方法不接受REPEATABLE，并且将总是为每一次使用产生新的采样。

##tablesample system
system抽样方式为随机抽取表的数据块上的数据，抽样级别为数据块级别。被选中的所有行都会被返回。

postgres=# explain analyze verbose select * from test01 tablesample system(0.01);
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Sample Scan on public.test01  (cost=0.00..12.50 rows=50 width=1032) (actual time=0.056..0.175 rows=45 loops=1)
   Output: id, val
   Sampling: system ('0.01'::real)
 Planning time: 0.027 ms
 Execution time: 0.189 ms
(5 rows)

Time: 0.493 ms

##tablesample bernoulli
system抽样方式为随机抽取表的数据行数据，抽样级别为数据行级别。比 system 具有更好的随机性，但是性能要差很多。

postgres=# explain analyze verbose select * from test01 tablesample bernoulli(0.01);
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sample Scan on public.test01  (cost=0.00..33334.50 rows=50 width=1032) (actual time=1.688..80.150 rows=47 loops=1)
   Output: id, val
   Sampling: bernoulli ('0.01'::real)
 Planning time: 0.029 ms
 Execution time: 80.173 ms
(5 rows)

Time: 80.461 ms

可以看出 bernoulli 比 system 的性能要低很多，耗时大概为160倍。

参考：
http://postgres.cn/docs/10/sql-select.html