On Sun, May 28, 2017 at 9:49 AM, Ludovic Vaugeois-Pepin <ludovicvp@xxxxxxxxx> wrote: > > > On Sun, May 28, 2017 at 6:54 AM, Michael Paquier <michael.paquier@xxxxxxxxx> wrote: >> >> On Sat, May 27, 2017 at 2:40 PM, Ludovic Vaugeois-Pepin >> <ludovicvp@xxxxxxxxx> wrote: >> > Say, with 9.6.2, a hot_standby fails to connect to a replication slot: >> > FATAL: could not start WAL streaming: ERROR: replication slot "test3" >> > does not exist >> > or >> > FATAL: could not connect to the primary server: FATAL: the database >> > system is starting up >> > >> > Is there a way to reduce the time it takes until the next attempt? I >> > assumed, wrongly I think, that this would be wal_retrieve_retry_interval, >> > but it seems that it won't make a difference. I tried setting it to 3s, but >> > it seems to take 15s still. Here are two log samples: >> >> Could you double-check your configuration? If I set >> wal_retrieve_retry_interval to 1s on a standby, I am able to see a >> connection attempt from a WAL receiver happening with this interval of >> time in the case of repetitive failures. > > > It really is set at 3s on all servers (master and standbies) earlier in the "deployment" process at the same time "listen_addresses", "hot_standby", and others are set. This doesn't seem to happen every time I run tests. I increased logging to DEBUG1. This is what I got when the problem occurred again. Note that the empty line is there in the log file. > > < 2017-05-28 09:29:36.127 CEST > LOG: database system was shut down in recovery at 2017-05-28 09:29:03 CEST > < 2017-05-28 09:29:36.127 CEST > LOG: entering standby mode > < 2017-05-28 09:29:36.127 CEST > DEBUG: checkpoint record is at 0/80002B8 > < 2017-05-28 09:29:36.128 CEST > DEBUG: redo record is at 0/80002B8; shutdown TRUE > < 2017-05-28 09:29:36.128 CEST > DEBUG: next transaction ID: 0:2535; next OID: 18660 > < 2017-05-28 09:29:36.128 CEST > DEBUG: next MultiXactId: 1; next MultiXactOffset: 0 > < 2017-05-28 09:29:36.128 CEST > DEBUG: oldest unfrozen transaction ID: 1750, in database 1 > < 2017-05-28 09:29:36.128 CEST > DEBUG: oldest MultiXactId: 1, in database 1 > < 2017-05-28 09:29:36.128 CEST > DEBUG: commit timestamp Xid oldest/newest: 0/0 > < 2017-05-28 09:29:36.128 CEST > DEBUG: transaction ID wrap limit is 2147485397, limited by database with OID 1 > < 2017-05-28 09:29:36.128 CEST > DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 > < 2017-05-28 09:29:36.128 CEST > DEBUG: starting up replication slots > < 2017-05-28 09:29:36.128 CEST > DEBUG: resetting unlogged relations: cleanup 1 init 0 > < 2017-05-28 09:29:36.129 CEST > DEBUG: initializing for hot standby > < 2017-05-28 09:29:36.129 CEST > DEBUG: recovery snapshots are now enabled > < 2017-05-28 09:29:36.129 CEST > LOG: consistent recovery state reached at 0/8000328 > < 2017-05-28 09:29:36.129 CEST > LOG: invalid record length at 0/8000328: wanted 24, got 0 > < 2017-05-28 09:29:36.129 CEST > LOG: database system is ready to accept read only connections > < 2017-05-28 09:29:36.135 CEST > FATAL: could not connect to the primary server: FATAL: the database system is starting up > > < 2017-05-28 09:29:36.135 CEST > DEBUG: invalid record length at 0/8000328: wanted 24, got 0 > < 2017-05-28 09:29:51.153 CEST > DEBUG: invalid record length at 0/8000328: wanted 24, got 0 > < 2017-05-28 09:29:51.158 CEST > LOG: fetching timeline history file for timeline 4 from primary server > < 2017-05-28 09:29:51.160 CEST > LOG: started streaming WAL from primary at 0/8000000 on timeline 3 > < 2017-05-28 09:29:51.160 CEST > LOG: replication terminated by primary server > < 2017-05-28 09:29:51.160 CEST > DETAIL: End of WAL reached on timeline 3 at 0/8000328. > < 2017-05-28 09:29:51.161 CEST > DEBUG: walreceiver ended streaming and awaits new instructions > < 2017-05-28 09:29:51.161 CEST > LOG: new target timeline is 4 > < 2017-05-28 09:29:51.161 CEST > DEBUG: invalid record length at 0/8000328: wanted 24, got 0 > < 2017-05-28 09:29:51.161 CEST > LOG: restarted WAL streaming at 0/8000000 on timeline 4 > < 2017-05-28 09:29:51.191 CEST > LOG: redo starts at 0/8000328 And this is the log with log_min_messages DEBUG5: < 2017-05-28 14:48:10.108 CEST > LOG: invalid record length at 0/8000398: wanted 24, got 0 < 2017-05-28 14:48:10.108 CEST > DEBUG: switched WAL source from archive to stream after failure < 2017-05-28 14:48:10.110 CEST > DEBUG: checkpointer updated shared memory configuration values < 2017-05-28 14:48:10.110 CEST > LOG: database system is ready to accept read only connections < 2017-05-28 14:48:10.112 CEST > DEBUG: find_in_dynamic_libpath: trying "/usr/pgsql-9.6/lib/libpqwalreceiver" < 2017-05-28 14:48:10.112 CEST > DEBUG: find_in_dynamic_libpath: trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so" < 2017-05-28 14:48:10.126 CEST > FATAL: could not connect to the primary server: FATAL: the database system is starting up < 2017-05-28 14:48:10.126 CEST > DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make < 2017-05-28 14:48:10.126 CEST > DEBUG: shmem_exit(1): 4 on_shmem_exit callbacks to make < 2017-05-28 14:48:10.127 CEST > DEBUG: proc_exit(1): 2 callbacks to make < 2017-05-28 14:48:10.127 CEST > DEBUG: exit(1) < 2017-05-28 14:48:10.127 CEST > DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make < 2017-05-28 14:48:10.127 CEST > DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make < 2017-05-28 14:48:10.127 CEST > DEBUG: proc_exit(-1): 0 callbacks to make < 2017-05-28 14:48:10.127 CEST > DEBUG: switched WAL source from stream to archive after failure < 2017-05-28 14:48:10.127 CEST > DEBUG: invalid record length at 0/8000398: wanted 24, got 0 < 2017-05-28 14:48:10.127 CEST > DEBUG: switched WAL source from archive to stream after failure < 2017-05-28 14:48:10.130 CEST > DEBUG: reaping dead processes < 2017-05-28 14:48:11.062 CEST > DEBUG: forked new backend, pid=3184 socket=12 < 2017-05-28 14:48:11.063 CEST > DEBUG: postgres child[3184]: starting with ( < 2017-05-28 14:48:11.063 CEST > DEBUG: postgres < 2017-05-28 14:48:11.063 CEST > DEBUG: ) < 2017-05-28 14:48:11.063 CEST > DEBUG: InitPostgres < 2017-05-28 14:48:11.063 CEST > DEBUG: my backend ID is 2 < 2017-05-28 14:48:11.064 CEST > DEBUG: StartTransaction < 2017-05-28 14:48:11.064 CEST > DEBUG: name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: < 2017-05-28 14:48:11.075 CEST > DEBUG: CommitTransaction < 2017-05-28 14:48:11.075 CEST > DEBUG: name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: < 2017-05-28 14:48:11.077 CEST > DEBUG: shmem_exit(0): 1 before_shmem_exit callbacks to make < 2017-05-28 14:48:11.077 CEST > DEBUG: shmem_exit(0): 6 on_shmem_exit callbacks to make < 2017-05-28 14:48:11.077 CEST > DEBUG: proc_exit(0): 3 callbacks to make < 2017-05-28 14:48:11.077 CEST > DEBUG: exit(0) < 2017-05-28 14:48:11.077 CEST > DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make < 2017-05-28 14:48:11.077 CEST > DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make < 2017-05-28 14:48:11.077 CEST > DEBUG: proc_exit(-1): 0 callbacks to make < 2017-05-28 14:48:11.081 CEST > DEBUG: reaping dead processes < 2017-05-28 14:48:11.081 CEST > DEBUG: server process (PID 3184) exited with exit code 0 < 2017-05-28 14:48:11.095 CEST > DEBUG: forked new backend, pid=3186 socket=12 < 2017-05-28 14:48:11.096 CEST > DEBUG: postgres child[3186]: starting with ( < 2017-05-28 14:48:11.096 CEST > DEBUG: postgres < 2017-05-28 14:48:11.096 CEST > DEBUG: ) ... < 2017-05-28 14:48:24.985 CEST > DEBUG: postgres child[3430]: starting with ( < 2017-05-28 14:48:24.985 CEST > DEBUG: postgres < 2017-05-28 14:48:24.985 CEST > DEBUG: ) < 2017-05-28 14:48:24.985 CEST > DEBUG: InitPostgres < 2017-05-28 14:48:24.985 CEST > DEBUG: my backend ID is 2 < 2017-05-28 14:48:24.986 CEST > DEBUG: StartTransaction < 2017-05-28 14:48:24.986 CEST > DEBUG: name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: < 2017-05-28 14:48:24.989 CEST > DEBUG: CommitTransaction < 2017-05-28 14:48:24.989 CEST > DEBUG: name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: < 2017-05-28 14:48:24.989 CEST > DEBUG: StartTransactionCommand < 2017-05-28 14:48:24.989 CEST > DEBUG: StartTransaction < 2017-05-28 14:48:24.989 CEST > DEBUG: name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: < 2017-05-28 14:48:24.992 CEST > DEBUG: CommitTransactionCommand < 2017-05-28 14:48:24.992 CEST > DEBUG: CommitTransaction < 2017-05-28 14:48:24.992 CEST > DEBUG: name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: < 2017-05-28 14:48:24.993 CEST > DEBUG: shmem_exit(0): 1 before_shmem_exit callbacks to make < 2017-05-28 14:48:24.993 CEST > DEBUG: shmem_exit(0): 6 on_shmem_exit callbacks to make < 2017-05-28 14:48:24.994 CEST > DEBUG: proc_exit(0): 3 callbacks to make < 2017-05-28 14:48:24.994 CEST > DEBUG: exit(0) < 2017-05-28 14:48:24.994 CEST > DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make < 2017-05-28 14:48:24.994 CEST > DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make < 2017-05-28 14:48:24.994 CEST > DEBUG: proc_exit(-1): 0 callbacks to make < 2017-05-28 14:48:24.996 CEST > DEBUG: reaping dead processes < 2017-05-28 14:48:24.996 CEST > DEBUG: server process (PID 3430) exited with exit code 0 < 2017-05-28 14:48:25.144 CEST > DEBUG: switched WAL source from stream to archive after failure < 2017-05-28 14:48:25.144 CEST > DEBUG: invalid record length at 0/8000398: wanted 24, got 0 < 2017-05-28 14:48:25.144 CEST > DEBUG: switched WAL source from archive to stream after failure < 2017-05-28 14:48:25.147 CEST > DEBUG: find_in_dynamic_libpath: trying "/usr/pgsql-9.6/lib/libpqwalreceiver" < 2017-05-28 14:48:25.147 CEST > DEBUG: find_in_dynamic_libpath: trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so" < 2017-05-28 14:48:25.161 CEST > LOG: fetching timeline history file for timeline 4 from primary server < 2017-05-28 14:48:25.166 CEST > LOG: started streaming WAL from primary at 0/8000000 on timeline 3 < 2017-05-28 14:48:25.166 CEST > DEBUG: sendtime 2017-05-28 14:48:25.16509+02 receipttime 2017-05-28 14:48:25.166539+02 replication apply delay (N/A) transfer latency 1 ms < 2017-05-28 14:48:25.166 CEST > LOG: replication terminated by primary server < 2017-05-28 14:48:25.166 CEST > DETAIL: End of WAL reached on timeline 3 at 0/8000398. < 2017-05-28 14:48:25.166 CEST > DEBUG: sending write 0/8000398 flush 0/8000398 apply 0/8000398 < 2017-05-28 14:48:25.167 CEST > DEBUG: walreceiver ended streaming and awaits new instructions < 2017-05-28 14:48:25.168 CEST > LOG: new target timeline is 4 < 2017-05-28 14:48:25.168 CEST > DEBUG: switched WAL source from stream to archive after failure < 2017-05-28 14:48:25.168 CEST > DEBUG: invalid record length at 0/8000398: wanted 24, got 0 < 2017-05-28 14:48:25.168 CEST > DEBUG: switched WAL source from archive to stream after failure < 2017-05-28 14:48:25.169 CEST > LOG: restarted WAL streaming at 0/8000000 on timeline 4 < 2017-05-28 14:48:25.170 CEST > DEBUG: sendtime 2017-05-28 14:48:25.168651+02 receipttime 2017-05-28 14:48:25.170063+02 replication apply delay (N/A) transfer latency 1 ms < 2017-05-28 14:48:25.170 CEST > DEBUG: creating and filling new WAL file < 2017-05-28 14:48:25.225 CEST > DEBUG: done creating and filling new WAL file < 2017-05-28 14:48:25.225 CEST > DEBUG: sending write 0/80006C8 flush 0/8000398 apply 0/8000398 < 2017-05-28 14:48:25.226 CEST > DEBUG: sending write 0/80006C8 flush 0/80006C8 apply 0/8000398 < 2017-05-28 14:48:25.227 CEST > DEBUG: sending hot standby feedback xmin 2535 epoch 0 -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general