Hi,
I had a cascade serverA->serverB->serverC->serverD of Postgres 10.14 servers connected with streaming replication.
There was no archive shipping set up, but there was an empty directory /data/pg_archive/10/dedupe_shard1_10/ mentioned in config for it on each of the servers.
When I promoted serverB, serverC crashed:
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG: replication terminated by primary server
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-2] 2021-06-16 14:45:43.717 UTC [43934] {-} DETAIL: End of WAL reached on timeline 1 at 190B0/36000000.
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [8-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG: fetching timeline history file for timeline 2 from primary server
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[33222]: [9-1] 2021-06-16 14:45:43.736 UTC [33222] {-} LOG: new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [9-1] 2021-06-16 14:45:43.746 UTC [43934] {-} LOG: restarted WAL streaming at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [10-1] 2021-06-16 14:45:43.746 UTC [43934] {-} FATAL: could not receive data from WAL stream: ERROR: requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [7-1] 2021-06-16 14:45:43.764 UTC [93365] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [8-1] 2021-06-16 14:45:43.764 UTC [93365] {-} FATAL: could not receive data from WAL stream: ERROR: requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [7-1] 2021-06-16 14:45:48.771 UTC [93421] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [10-1] 2021-06-16 14:45:48.792 UTC [33222] {-} LOG: invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [8-1] 2021-06-16 14:45:48.793 UTC [93421] {-} FATAL: terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [11-1] 2021-06-16 14:45:48.803 UTC [33222] {-} PANIC: could not open file "pg_wal/00000002000190B000000035": No such file or directory
After I started it it seems to have caught up:
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100187]: [8-1] 2021-06-16 14:59:09.826 UTC [100187] {-} LOG: consistent recovery state reached at 190B0/35FFFFB8
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100172]: [5-1] 2021-06-16 14:59:09.827 UTC [100172] {-} LOG: database system is ready to accept read only connections
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[100187]: [9-1] 2021-06-16 14:59:10.281 UTC [100187] {-} LOG: invalid record length at 190B0/36000898: wanted 24, got 0
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101331]: [6-1] 2021-06-16 14:59:10.292 UTC [101331] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101332]: [6-1] 2021-06-16 14:59:10.332 UTC [101332] {[unknown]-[local]} [unknown]@[unknown] LOG: incomplete startup packet
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-1] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] ERROR: requested starting point 190B0/36000000 on timeline 1 is not in this server's history
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-2] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] DETAIL: This server's history forked from timeline 1 at 190B0/35FFFFB8.
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [6-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG: could not receive data from client: Connection reset by peer
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [7-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG: unexpected EOF on standby connection
But now serverD crashed in a similar fashion:
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [6-1] 2021-06-16 14:59:10.500 UTC [29127] {-} LOG: fetching timeline history file for timeline 2 from primary server
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-1] 2021-06-16 14:59:10.509 UTC [29127] {-} FATAL: could not start WAL streaming: ERROR: requested starting point 190B0/36000000 on timeline 1 is
not in this server's history
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-2] #011DETAIL: This server's history forked from timeline 1 at 190B0/35FFFFB8.
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [9-1] 2021-06-16 14:59:10.524 UTC [32792] {-} LOG: new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [6-1] 2021-06-16 14:59:10.556 UTC [29138] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [10-1] 2021-06-16 14:59:10.588 UTC [32792] {-} LOG: invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [7-1] 2021-06-16 14:59:10.588 UTC [29138] {-} FATAL: terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [11-1] 2021-06-16 14:59:10.598 UTC [32792] {-} PANIC: could not open file "pg_wal/00000002000190B000000035": No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29143]: [6-1] 2021-06-16 14:59:10.616 UTC [29143] {-} LOG: started streaming WAL from primary at 190B0/35000000 on timeline 2
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [6-1] 2021-06-16 14:59:11.505 UTC [32790] {-} LOG: startup process (PID 32792) was terminated by signal 6: Aborted
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [7-1] 2021-06-16 14:59:11.506 UTC [32790] {-} LOG: terminating any other active server processes
I had a cascade serverA->serverB->serverC->serverD of Postgres 10.14 servers connected with streaming replication.
There was no archive shipping set up, but there was an empty directory /data/pg_archive/10/dedupe_shard1_10/ mentioned in config for it on each of the servers.
When I promoted serverB, serverC crashed:
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG: replication terminated by primary server
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-2] 2021-06-16 14:45:43.717 UTC [43934] {-} DETAIL: End of WAL reached on timeline 1 at 190B0/36000000.
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [8-1] 2021-06-16 14:45:43.717 UTC [43934] {-} LOG: fetching timeline history file for timeline 2 from primary server
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[33222]: [9-1] 2021-06-16 14:45:43.736 UTC [33222] {-} LOG: new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [9-1] 2021-06-16 14:45:43.746 UTC [43934] {-} LOG: restarted WAL streaming at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [10-1] 2021-06-16 14:45:43.746 UTC [43934] {-} FATAL: could not receive data from WAL stream: ERROR: requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [7-1] 2021-06-16 14:45:43.764 UTC [93365] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [8-1] 2021-06-16 14:45:43.764 UTC [93365] {-} FATAL: could not receive data from WAL stream: ERROR: requested starting point 190B0/36000000 is ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [7-1] 2021-06-16 14:45:48.771 UTC [93421] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [10-1] 2021-06-16 14:45:48.792 UTC [33222] {-} LOG: invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [8-1] 2021-06-16 14:45:48.793 UTC [93421] {-} FATAL: terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [11-1] 2021-06-16 14:45:48.803 UTC [33222] {-} PANIC: could not open file "pg_wal/00000002000190B000000035": No such file or directory
After I started it it seems to have caught up:
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100187]: [8-1] 2021-06-16 14:59:09.826 UTC [100187] {-} LOG: consistent recovery state reached at 190B0/35FFFFB8
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100172]: [5-1] 2021-06-16 14:59:09.827 UTC [100172] {-} LOG: database system is ready to accept read only connections
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[100187]: [9-1] 2021-06-16 14:59:10.281 UTC [100187] {-} LOG: invalid record length at 190B0/36000898: wanted 24, got 0
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101331]: [6-1] 2021-06-16 14:59:10.292 UTC [101331] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101332]: [6-1] 2021-06-16 14:59:10.332 UTC [101332] {[unknown]-[local]} [unknown]@[unknown] LOG: incomplete startup packet
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-1] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] ERROR: requested starting point 190B0/36000000 on timeline 1 is not in this server's history
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-2] 2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)} replication@[unknown] DETAIL: This server's history forked from timeline 1 at 190B0/35FFFFB8.
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [6-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG: could not receive data from client: Connection reset by peer
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [7-1] 2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)} replication@[unknown] LOG: unexpected EOF on standby connection
But now serverD crashed in a similar fashion:
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [6-1] 2021-06-16 14:59:10.500 UTC [29127] {-} LOG: fetching timeline history file for timeline 2 from primary server
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-1] 2021-06-16 14:59:10.509 UTC [29127] {-} FATAL: could not start WAL streaming: ERROR: requested starting point 190B0/36000000 on timeline 1 is
not in this server's history
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-2] #011DETAIL: This server's history forked from timeline 1 at 190B0/35FFFFB8.
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [9-1] 2021-06-16 14:59:10.524 UTC [32792] {-} LOG: new target timeline is 2
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [6-1] 2021-06-16 14:59:10.556 UTC [29138] {-} LOG: started streaming WAL from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [10-1] 2021-06-16 14:59:10.588 UTC [32792] {-} LOG: invalid contrecord length 1585 at 190B0/35FFFFB8
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [7-1] 2021-06-16 14:59:10.588 UTC [29138] {-} FATAL: terminating walreceiver process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history': No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [11-1] 2021-06-16 14:59:10.598 UTC [32792] {-} PANIC: could not open file "pg_wal/00000002000190B000000035": No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29143]: [6-1] 2021-06-16 14:59:10.616 UTC [29143] {-} LOG: started streaming WAL from primary at 190B0/35000000 on timeline 2
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [6-1] 2021-06-16 14:59:11.505 UTC [32790] {-} LOG: startup process (PID 32792) was terminated by signal 6: Aborted
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [7-1] 2021-06-16 14:59:11.506 UTC [32790] {-} LOG: terminating any other active server processes
and it came
back again too after restart.
Is it all expected? Is the data safe and consistent now or should I re-init serverC and serverD?
Best regards, Alex
Is it all expected? Is the data safe and consistent now or should I re-init serverC and serverD?
Best regards, Alex