Thank you so very much, Tom. Vacuuming fixed the TOAST corruption issue and we’ll upgrade our instances tonight (max RDS has is 10.3, but that’s a start). > On Jun 5, 2018, at 8:07 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: > > Jonathan Marks <jonathanaverymarks@xxxxxxxxx> writes: >> We had two issues today (once this morning and once a few minutes ago) >> with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB >> total disk space, 20k PIOPS) where the database suddenly crashed and >> went into recovery mode. > > I'd suggest updating to 10.4 ... see below. > >> Both times that the server crashed, we saw this in the logs: >> 2018-06-05 23:08:44 UTC:172.31.7.89(36224):production@OURDB:[12173]:ERROR: canceling statement due to statement timeout >> 2018-06-05 23:08:44 UTC::@:[48863]:LOG: worker process: parallel worker for PID 12173 (PID 20238) exited with exit code 1 >> 2018-06-05 23:08:49 UTC::@:[48863]:LOG: server process (PID 12173) was terminated by signal 11: Segmentation fault > > This looks to be a parallel leader process getting confused when a worker > process exits unexpectedly. There were some related fixes in 10.2, which > might resolve the issue, though it's also possible we have more to do there. > >> After the first crash, we then started getting errors like: >> 2018-06-05 23:08:45 UTC:172.31.6.84(33392):production@OURDB:[11888]:ERROR: unexpected chunk number 0 (expected 1) for toast value 1592283014 in pg_toast_26656 > > This definitely looks to be the "reuse of TOAST OIDs immediately after > crash" issue that was fixed in 10.4. AFAIK it's recoverable corruption; > I believe you'll find that VACUUMing the parent table will make the > errors stop, and all will be well. But an update would be prudent to > prevent it from happening again. > > regards, tom lane