At Sat, 1 Aug 2020 09:58:05 -0700, Ben Chobot <bench@xxxxxxxxxxxxxxx> wrote in > > > Alvaro Herrera wrote on 8/1/20 9:35 AM: > > On 2020-Aug-01, Ben Chobot wrote: > > > >> We have a few hundred postgres servers in AWS EC2, all of which do > >> streaming > >> replication to at least two replicas. As we've transitioned our fleet > >> to > >> from 9.5 to 12.3, we've noticed an alarming increase in the frequency > >> of a > >> streaming replica dying during replay. Postgres will log something > >> like: > >> > >> |2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] > >> |db=,user= > >> LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00 > >> hostA > >> postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets > >> supplied > >> 2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] > >> db=,user= > >> CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed > >> 1720 > >> 2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] > >> db=,user= LOG: > >> startup process (PID 24076) exited with exit code 1| > > I've never seen this one. > > > > Can you find out what the index is being modified by those LSNs -- is > > it > > always the same index? Can you have a look at nearby WAL records that > > touch the same page of the same index in each case? > > > > One possibility is that the storage forgot a previous write. > > I'd be happy to, if you tell me how. :) > > We're using xfs for our postgres filesystem, on ubuntu bionic. Of > course it's always possible there's something wrong in the filesystem > or the EBS layer, but that is one thing we have not changed in the > migration from 9.5 to 12.3. All of the cited log lines seem suggesting relation with deleted btree page items. As a possibility I can guess, that can happen if the pages were flushed out during a vacuum after the last checkpoint and full-page-writes didn't restored the page to the state before the index-item deletion happened(that is, if full_page_writes were set to off.). (If it found to be the cause, I'm not sure why that didn't happen on 9.5.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center