We have a few hundred postgres servers in AWS EC2, all of which do
streaming replication to at least two replicas. As we've transitioned
our fleet to from 9.5 to 12.3, we've noticed an alarming increase in the
frequency of a streaming replica dying during replay. Postgres will log
something like:
...or:
...or:
2020-07-30T14:59:36.839243+00:00 hostC postgres[24338]: [45253-1] db=,user= WARNING: specified item offset is too large
2020-07-30T14:59:36.839307+00:00 hostC postgres[24338]: [45253-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 hostC postgres[24338]: [45254-1] db=,user= PANIC: btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 hostC postgres[24338]: [45254-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 hostC postgres[24337]: [11-1] db=,user= LOG: startup process (PID 24338) was terminated by signal 6: Aborted
Each time, a simple restart of the postgres service will bring the database back to a happy state and it will merrily replicate past the LSN where it had died before. This has never (yet) happened on a primary db, and (so far) always on only one one of the replicas the primary is replicating to, leading me to think there isn't anything actually wrong with my data. Still, this is no way to run a database. We never saw this problem in 9.5, but it happened 3 times just on Friday. We have taken the opportunity to enable checksuming with our move to 12, but this doesn't appear to be related to that, to my untrained eyes. A lot of suggestions I've heard to fix problems which sound like this involve reindexing, and while we haven't yet tried that, it doesn't seem likely to help. We upgraded from 9.5 to 12 using pglogical, so all our indices were created from 12.3 code. That said, we were running pg_repack in an automated fashion for a bit, which seemed to be causing issues in 12 that we haven't had time to track down and so have currently disabled. With the non-durability of the problem, I'm out of ideas of what to look for. Suggestions? |