Re: Segfault leading to crash, recovery mode, and TOAST corruption

Jonathan Marks <jonathanaverymarks@xxxxxxxxx> · Tue, 5 Jun 2018 20:36:00 -0400

Thank you so very much, Tom.

Vacuuming fixed the TOAST corruption issue and we’ll upgrade our instances tonight (max RDS has is 10.3, but that’s a start).

> On Jun 5, 2018, at 8:07 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> 
> Jonathan Marks <jonathanaverymarks@xxxxxxxxx> writes:
>> We had two issues today (once this morning and once a few minutes ago)
>> with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB
>> total disk space, 20k PIOPS) where the database suddenly crashed and
>> went into recovery mode.
> 
> I'd suggest updating to 10.4 ... see below.
> 
>> Both times that the server crashed, we saw this in the logs:
>> 2018-06-05 23:08:44 UTC:172.31.7.89(36224):production@OURDB:[12173]:ERROR:  canceling statement due to statement timeout
>> 2018-06-05 23:08:44 UTC::@:[48863]:LOG:  worker process: parallel worker for PID 12173 (PID 20238) exited with exit code 1
>> 2018-06-05 23:08:49 UTC::@:[48863]:LOG:  server process (PID 12173) was terminated by signal 11: Segmentation fault
> 
> This looks to be a parallel leader process getting confused when a worker
> process exits unexpectedly.  There were some related fixes in 10.2, which
> might resolve the issue, though it's also possible we have more to do there.
> 
>> After the first crash, we then started getting errors like:
>> 2018-06-05 23:08:45 UTC:172.31.6.84(33392):production@OURDB:[11888]:ERROR:  unexpected chunk number 0 (expected 1) for toast value 1592283014 in pg_toast_26656
> 
> This definitely looks to be the "reuse of TOAST OIDs immediately after
> crash" issue that was fixed in 10.4.  AFAIK it's recoverable corruption;
> I believe you'll find that VACUUMing the parent table will make the
> errors stop, and all will be well.  But an update would be prudent to
> prevent it from happening again.
> 
> 			regards, tom lane