PostgreSQL12同步流复制搭建-同步不生效的问题、主库恢复后，无法与新主库同步问题

PostgreSQL12的流复制配置不再放到recovery.conf文件中，但是基本配置还是一样的，过年了也没心情工作，就来搭一下试试。

官方文档：

https://www.postgresql.org/docs/12/runtime-config-replication.html

开始：

1）下载安装包：

https://www.postgresql.org/docs/12/runtime-config-replication.html

2）解压并安装

tar xzvf postgresql-12.1.tar.gz
cd postgresql-12.1/
./configure --prefix=/opt/pg12 --without-zlib
su root -c 'chown -R postgres:postgres /opt/pg12'
make && make install

3）创建目录和环境变量，准备就在本机上创建两个data目录进行试验：data1为主 data2为备

cd /opt/pg12
mkdir data1
mkdir data2

vim ~/pg12.env 
source ~/pg12.env 
[postgres@localhost data1]$ cat ~/pg12.env 
export PGHOME=/opt/pg12/
export PATH=$PGHOME/bin:$PATH
export PGDATA=$PGHOME/data1
export PGPORT=54121

4）初始化数据库

cd data1

vim postgresql.conf 
修改：
port = 54121 
wal_level = replica
synchronous_commit = on
max_wal_senders = 10
wal_keep_segments = 1024
synchronous_standby_names = 'standby_node'
hot_standby = on
hot_standby_feedback = on
logging_collector = on

启动数据库：
pg_ctl start

cd ../data2
pg_basebackup -R -X stream -Fp -D ./ -h localhost -p 54121
vim postgresql.conf
修改：
recovery_target_timeline = 'latest'
primary_conninfo = 'application_name=standby_node host=localhost port=54121 user=postgres password=postgres'                                   
promote_trigger_file = '/opt/pg12/data2/promote_trigger_file'
port=54122

启动备数据库：
pg_ctl -D ./ start

5）查看流复制情况，发现是异步流复制，application_name没有生效：

postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 19728
usesysid         | 10
usename          | postgres
application_name | walreceiver
client_addr      | ::1
client_hostname  | 
client_port      | 37651
backend_start    | 2020-01-21 19:55:14.881115-08
backend_xmin     | 488
state            | streaming
sent_lsn         | 0/30175C0
write_lsn        | 0/30175C0
flush_lsn        | 0/30175C0
replay_lsn       | 0/30175C0
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 0
sync_state       | async
reply_time       | 2020-01-21 20:01:38.445309-08

进一步排查，发现是在postgresql.auto.conf中有自动生成的primary_conninfo配置，里面没有application_name配置，而postgresql.auto.conf文件的优先级高于postgresql.conf文件。在里面添加节点名称：

[postgres@localhost data2]$ cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo = 'application_name=standby_node user=postgres passfile=''/home/postgres/.pgpass'' host=localhost port=54121 sslmode=disable sslcompression=0 gssencmode=disable target_session_attrs=any'

6）在从节点上reload是不会生效的，必须重启从节点：

pg_ctl -D ./ restart

7）在主节点查看流复制，同步生效：

postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 20151
usesysid         | 10
usename          | postgres
application_name | standby_node
client_addr      | ::1
client_hostname  | 
client_port      | 37657
backend_start    | 2020-01-21 20:03:52.765047-08
backend_xmin     | 488
state            | streaming
sent_lsn         | 0/3017670
write_lsn        | 0/3017670
flush_lsn        | 0/3017670
replay_lsn       | 0/3017670
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 1
sync_state       | sync
reply_time       | 2020-01-21 20:04:02.884156-08

8）将主节点的synchronous_standby_names配置为any的方式，reload即可生效：

vim postgresql.conf
synchronous_standby_names = 'any 1 (standby_node,node1)'

pg_ctl reload

postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 20663
usesysid         | 10
usename          | postgres
application_name | standby_node
client_addr      | ::1
client_hostname  | 
client_port      | 37659
backend_start    | 2020-01-21 20:24:48.714207-08
backend_xmin     | 490
state            | streaming
sent_lsn         | 0/3022AD0
write_lsn        | 0/3022AD0
flush_lsn        | 0/3022AD0
replay_lsn       | 0/3022AD0
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 1
sync_state       | quorum
reply_time       | 2020-01-21 22:20:01.980318-08

说明：

sync_priority
	Priority of this standby server for being chosen as the synchronous standby in a priority-based synchronous replication. This has no effect in a quorum-based synchronous replication.
sync_state
	Synchronous state of this standby server. Possible values are:
　　　　 async: This standby server is asynchronous.
　　　　 potential: This standby server is now asynchronous, but can potentially become synchronous if one of current synchronous ones fails.
 　　　　sync: This standby server is synchronous.
 　　　　quorum: This standby server is considered as a candidate for quorum standbys.

从节点上看接收日志的情况：

postgres=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid                   | 23261
status                | streaming
receive_start_lsn     | 0/3000000
receive_start_tli     | 1
received_lsn          | 0/3024890
received_tli          | 1
last_msg_send_time    | 2020-01-21 22:32:16.786496-08
last_msg_receipt_time | 2020-01-21 22:32:16.786575-08
latest_end_lsn        | 0/3024890
latest_end_time       | 2020-01-21 22:29:16.556672-08
slot_name             | 
sender_host           | localhost
sender_port           | 54121
conninfo              | user=postgres passfile=/home/postgres/.pgpass dbname=replication host=localhost port=54121 application_name=standby_node fallback_application_name=walreceiver sslmode=disable sslcompression=0 gssencmode=disable target_session_attrs=any

9）添加slot进行同步：

主节点创建slot：

postgres=# select pg_create_physical_replication_slot('node1');
 pg_create_physical_replication_slot 
-------------------------------------
 (node1,)
(1 row)

postgres=# select * from pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 node1     |        | physical  |        |          | f         | f      |            |      |              |             | 
(1 row)

从节点使用slot node1进行同步：

echo "primary_slot_name = 'node1' " >> postgresql.conf

pg_ctl -D ./ restart

postgres=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid                   | 23468
status                | streaming
receive_start_lsn     | 0/3000000
receive_start_tli     | 1
received_lsn          | 0/6D6F5D0
received_tli          | 1
last_msg_send_time    | 2020-01-21 22:38:27.882638-08
last_msg_receipt_time | 2020-01-21 22:38:27.882767-08
latest_end_lsn        | 0/6D6F5D0
latest_end_time       | 2020-01-21 22:38:27.882638-08
slot_name             | node1
sender_host           | localhost
sender_port           | 54121
conninfo              | user=postgres passfile=/home/postgres/.pgpass dbname=replication host=localhost port=54121 application_name=standby_node fallback_application_name=walreceiver sslmode=disable sslcompression=0 gssencmode=disable target_session_attrs=any

主节点pg_replication_slots视图有变化：

postgres=# select * from pg_replication_slots;
-[ RECORD 1 ]-------+----------
slot_name           | node1
plugin              | 
slot_type           | physical
datoid              | 
database            | 
temporary           | f
active              | t
active_pid          | 23469
xmin                | 493
catalog_xmin        | 
restart_lsn         | 0/6D6F6B8
confirmed_flush_lsn |

复制槽slot的使用在前面分析过（同步其实就没有必要用复制槽了，异步需要让复制槽保留日志文件），在这里就不多说了。

10）pg12在SQL中添加了函数pg_promote()来提升从库，现在从库有三种方式：

a. pg_ctl promote

b. 创建promote_trigger_file

c. 在调用函数pg_promote()

11）当停掉的从主要重新加入集群，做从节点，需要注意已经没有standby_mode参数了，需要判断data目录下是否有standby.signal文件，下面是具体的说明：

19.5.4. Archive Recovery
This section describes the settings that apply only for the duration of the recovery. They must be reset for any subsequent recovery you wish to perform.

“Recovery” covers using the server as a standby or for executing a targeted recovery. Typically, standby mode would be used to provide high availability and/or read scalability, 
whereas a targeted recovery is used to recover from data loss.

To start the server in standby mode, create a file called standby.signal in the data directory. The server will enter recovery and will not stop recovery when the end of archived WAL is reached,
but will keep trying to continue recovery by connecting to the sending server as specified by the primary_conninfo setting and/or by fetching new WAL segments using restore_command. For this mode, 
the parameters from this section and Section 19.6.3 are of interest. Parameters from Section 19.5.5 will also be applied but are typically not useful in this mode.

To start the server in targeted recovery mode, create a file called recovery.signal in the data directory. If both standby.signal and recovery.signal files are created, standby mode takes precedence. 
Targeted recovery mode ends when the archived WAL is fully replayed, or when recovery_target is reached. In this mode, the parameters from both this section and Section 19.5.5 will be used.

12）recovery模式也做了改变，需要在data目录创建recovery.signal文件，恢复的目标统一为：

recovery_target_time

recovery_target_xid

recovery_target_name

recovery_target_lsn

recovery_target