Segfault leading to crash, recovery mode, and TOAST corruption

Jonathan Marks <jonathanaverymarks@xxxxxxxxx> · Tue, 5 Jun 2018 19:35:42 -0400

Hello —
We had two issues today (once this morning and once a few minutes ago) with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB total disk space, 20k PIOPS) where the database suddenly crashed and went into recovery mode. The first time this happened, we restarted the server after about 5 minutes in an attempt to get the system live, and the second time we let it stay in recovery mode until it recovered (took about 10 minutes). The system was not under high load in either case.

Both times that the server crashed, we saw this in the logs:

2018-06-05 23:08:44 UTC:172.31.7.89(36224):production@OURDB:[12173]:ERROR:  canceling statement due to statement timeout
2018-06-05 23:08:44 UTC::@:[48863]:LOG:  worker process: parallel worker for PID 12173 (PID 20238) exited with exit code 1
2018-06-05 23:08:49 UTC::@:[48863]:LOG:  server process (PID 12173) was terminated by signal 11: Segmentation fault

After the first crash, we then started getting errors like:

2018-06-05 23:08:45 UTC:172.31.6.84(33392):production@OURDB:[11888]:ERROR:  unexpected chunk number 0 (expected 1) for toast value 1592283014 in pg_toast_26656

We were able to identify 15 rows that are corrupted and the exact fields that are being TOASTED. We’re following Josh Berkus’ post here: http://www.databasesoup.com/2013/10/de-corrupting-toast-tables.html.

We have tried to update those rows to change the bad fields by using UPDATE and DELETE, but every time we do we get an error: ERROR: tuple concurrently updated

We’re intending to reindex the TOAST table this evening, then try to delete again, and then run pg_repack. However, while that may resolve the TOAST corruption, we don’t believe it’s the root cause of this issue. We can in theory restore from one of our backups, but that would result in data loss for our clients and may not necessarily resolve the issue. We’re worried that this is a Postgres bug, perhaps due to parallelization — would appreciate any guidance people can give.

Thank you!