12.3 replicas falling over during WAL redo

Ben Chobot <bench@xxxxxxxxxxxxxxx> · Sat, 1 Aug 2020 07:55:27 -0700

We have a few hundred postgres servers in AWS EC2, all of which do 
streaming replication to at least two replicas. As we've transitioned 
our fleet to from 9.5 to 12.3, we've noticed an alarming increase in the
 frequency of a streaming replica dying during replay. Postgres will log
 something like:

2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user= LOG:  restartpoint starting: time
2020-07-31T16:55:24.637150+00:00 hostA postgres[24076]: [15754-1] db=,user= FATAL:  incorrect index offsets supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user= CONTEXT:  WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG:  startup process (PID 24076) exited with exit code 1

...or:

2020-07-31T10:43:35.520066+00:00 hostB postgres[11836]: [12-1] db=,user= WARNING:  will not overwrite a used ItemId
2020-07-31T10:43:35.520144+00:00 hostB postgres[11836]: [12-2] db=,user= CONTEXT:  WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0
2020-07-31T10:43:35.520180+00:00 hostB postgres[11836]: [13-1] db=,user= PANIC:  failed to add tuple
2020-07-31T10:43:35.520220+00:00 hostB postgres[11836]: [13-2] db=,user= CONTEXT:  WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0

...or:

  2020-07-30T14:59:36.839243+00:00 hostC postgres[24338]: [45253-1] db=,user= WARNING:  specified item offset is too large
2020-07-30T14:59:36.839307+00:00 hostC postgres[24338]: [45253-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 hostC postgres[24338]: [45254-1] db=,user= PANIC:  btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 hostC postgres[24338]: [45254-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 hostC postgres[24337]: [11-1] db=,user= LOG:  startup process (PID 24338) was terminated by signal 6: Aborted

Each time, a simple restart of the postgres service will bring the 
database back to a happy state and it will merrily replicate past the 
LSN where it had died before. This has never (yet) happened on a primary
 db, and (so far) always on only one one of the replicas the primary is 
replicating to, leading me to think there isn't anything actually wrong 
with my data. Still, this is no way to run a database. We never saw this
 problem in 9.5, but it happened 3 times just on Friday. We have taken 
the 
opportunity to enable checksuming with our move to 12, but this doesn't
 appear to be related to that, to my untrained eyes. 

A lot of suggestions I've heard to fix problems which sound like this 
involve reindexing, and while we haven't yet tried that, it doesn't seem
 likely to help. We upgraded from 9.5 to 12 using pglogical, so all our 
indices were created from 12.3 code. That said, we were running 
pg_repack in an automated fashion for a bit, which seemed to be causing 
issues in 12 that we haven't had time to track down and so have 
currently disabled. 

With the non-durability of the problem, I'm out of ideas of what to look
 for. Suggestions?