pg_bulkload快速加载数据

pg_bulkload

pg_bulkload

git下载地址：https://github.com/ossc-db/pg_bulkload

介绍

　　PostgreSQL提供了一个copy命令的便利数据加载工具，copy命令源于PostgreSQL数据库，copy命令支持文件与表之间的数据加载和表对文件的数据卸载。pg_bulkload是一种用于PostgreSQL的高速数据加载工具，相比copy命令。最大的优势就是速度。优势在让我们跳过shared buffer、wal buffer。直接写文件。pg_bulkload的direct模式就是这种思路来实现的，它还包含了数据恢复功能，即导入失败的话，需要恢复。

架构图

　pg_bulkload主要包括两个模块：reader和writer。reader负责读取文件、解析tuple，writer负责把解析出的tuple写入输出源中。pg_bulkload最初的版本功能很简单，只是加载数据。3.1版本增加了数据过滤的功能。

安装

unzip pg_bulkload-master.zip
cd unzip pg_bulkload-master
make
make installl

安装完成后再数据库中创建extension

create extension pg_bulkload;

pg_bulkload 参数

[thunisoft@sdbserver1 pg_bulkload-master]$ pg_bulkload --help
pg_bulkload is a bulk data loading tool for PostgreSQL

Usage:
Dataload: pg_bulkload [dataload options] control_file_path
Recovery: pg_bulkload -r [-D DATADIR]

Dataload options:
-i, --input=INPUT INPUT path or function
-O, --output=OUTPUT OUTPUT path or table
-l, --logfile=LOGFILE LOGFILE path
-P, --parse-badfile=* PARSE_BADFILE path
-u, --duplicate-badfile=* DUPLICATE_BADFILE path
-o, --option="key=val" additional option

Recovery options:
-r, --recovery execute recovery
-D, --pgdata=DATADIR database directory

Connection options:
-d, --dbname=DBNAME database to connect
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port
-U, --username=USERNAME user name to connect as
-w, --no-password never prompt for password
-W, --password force password prompt

Generic options:
-e, --echo echo queries
-E, --elevel=LEVEL set output message level
--help show this help, then exit
--version output version information, then exit

Read the website for details. http://github.com/ossc-db/pg_bulkload
Report bugs to http://github.com/ossc-db/pg_bulkload/issues.

-o的选项在help中没有，可以通过下面导入的test.log日志里面可以看到有哪些参数配置

pg_bulkload使用

db_sqlfx=# create table test(id int,info text ,crt_time timestamp(0));
CREATE TABLE
db_sqlfx=# insert into test select generate_series(1,5000000),md5(random()::text),clock_timestamp();
INSERT 0 5000000
db_sqlfx=# create index i_test_id on test(id);
CREATE INDEX
db_sqlfx=# copy test to '/home/thunisoft/test.csv' with(format 'csv');
COPY 5000000

--不清理索引直接导入
[thunisoft@sdbserver1 pg_bulkload-master]$ pg_bulkload -i /home/thunisoft/test.csv -O test -l /home/thunisoft/test.log -p 6543 -d db_sqlfx -o "TYPE=CSV" -o "WRITER=PARALLEL"  -o "TRUNCATE=YES"
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
	0 Rows skipped.
	5000000 Rows successfully loaded.
	0 Rows not loaded due to parse errors.
	0 Rows not loaded due to duplicate errors.
	0 Rows replaced with new rows.

这个地方需要注意的是，copy导出要指定with(format 'csv')，否则会报错

pg_bulkload耗时

21.03 sec

[thunisoft@sdbserver1 pg_bulkload-master]$ cat /home/thunisoft/test.log

pg_bulkload 3.1.18 on 2021-09-22 18:13:26.168808+08

INPUT = /home/thunisoft/test.csv
PARSE_BADFILE = /home/thunisoft/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.prs.csv
LOGFILE = /home/thunisoft/test.log
LIMIT = INFINITE
PARSE_ERRORS = 0
CHECK_CONSTRAINTS = NO
TYPE = CSV
SKIP = 0
DELIMITER = ,
QUOTE = """
ESCAPE = """
NULL = 
OUTPUT = public.test
MULTI_PROCESS = YES
VERBOSE = NO
WRITER = DIRECT
DUPLICATE_BADFILE = /home/thunisoft/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.dup.csv
DUPLICATE_ERRORS = 0
ON_DUPLICATE_KEEP = NEW
TRUNCATE = YES

  0 Rows skipped.
  5000000 Rows successfully loaded.
  0 Rows not loaded due to parse errors.
  0 Rows not loaded due to duplicate errors.
  0 Rows replaced with new rows.

Run began on 2021-09-22 18:13:26.168808+08
Run ended on 2021-09-22 18:13:47.19649+08

CPU 0.66s/8.85u sec elapsed 21.03 sec

copy导入耗时

--导出数据
db_sqlfx=#  copy test to '/home/thunisoft/test.sql';
COPY 5000000

--清理数据
db_sqlfx=# truncate table test;
TRUNCATE TABLE
Time: 95.247 ms

--导入
db_sqlfx=# copy test from '/home/thunisoft/test.csv';
COPY 5000000
Time: 38811.062 ms (00:38.811)

在有索引的情况下，pg_bulkload的效率比copy要高

在不创建索引的情况下，导入和copy的效率对比？

--删除索引
drop index i_test_id;
--copy导入数据24s
db_sqlfx=# copy test from '/home/thunisoft/test.csv';
COPY 5000000
Time: 24227.334 ms (00:24.227)
--使用pg_bulload导入10s
pg_bulkload:10s
--创建索引10s
db_sqlfx=# CREATE INDEX i_test_id ON public.test USING btree (id);
CREATE INDEX
Time: 10552.171 ms (00:10.552)

unlogged table

--索引保留，清空数据
update pg_class set relpersistence='u' where relname='test';  
update pg_class set relpersistence='u' where relname='i_test_id'; 

--pg_bulkload 17s
Run began on 2021-09-23 16:22:52.192337+08
Run ended on 2021-09-23 16:23:09.726125+08
CPU 0.63s/7.43u sec elapsed 17.53 sec

--copy
db_sqlfx=# copy test from '/home/thunisoft/test.csv';
COPY 5000000
Time: 27020.258 ms (00:27.020)

copy 在无日志的情况下提升较大，pg_bulkload提升不明显，因为本身已经绕过了shared_buffers

附一个批量的脚本：

 1 -bash-4.1$ cat load.sh
 2 #!/bin/sh
 3 
 4 #$1 data fil ename
 5 
 6 file=$1
 7 
 8 if [ ! -f $file  ]
 9 then
10     echo "File is not exist"
11     exit 1
12 fi
13 
14 echo "-----------------------------------------------------------------"
15 
16 tbname=$( echo $file |cut -d . -f1 )
17 echo "Table name is : "$tbname
18 
19 zcat $file|pg_bulkload -i stdin -O public.$tbname -l $tbname.log -o "TYPE=CSV" -o "WRITER=PARALLEL" -d sgdw
20 
21 echo "load complete"
22 echo "-----------------------------------------------------------------"

总结

1、在有索引的情况下，pg_bulkload的效率比copy要高一倍，没有索引的情况pg_bulkload效率也要高

2、copy在unlogged table表的性能有提升，但是和pg_bulkload还有些差距

3、pg_bulkload的优势还是比较明显，主要在于绕了shared buffer、wal buffer，直接写文件，这种模式通unlogged table一样，不需要写wal。对于使用了流复制备库的场景，需要重建备库。基础备份也需要重新生成

4、当有超大表的时候导入性能还是非常好