Hello,
I'm troubleshooting an issue where about once a week, a database appears to lock up and then the PostgreSQL process crashes and recovers. When this happens, a few queries will be logged, but there is no pattern to which queries are executing when the crash happens, and the queries logged don't appear to be queries that would consume a lot of resources.
It seems like something in the backend is locking up, causing queries to slow down or fail before Postgres crashes.
Here's an example of the log output leading up to and following one of these crashes:
2023-03-30 13:03:21.943 UTC [4155] LOG: duration: 14232.602 ms statement: START TRANSACTION;
2023-03-30 13:03:25.947 UTC [8899] LOG: duration: 17269.756 ms statement: BEGIN
2023-03-30 13:03:28.805 UTC [8874] LOG: duration: 19987.241 ms statement: BEGIN
2023-03-30 13:03:32.068 UTC [8326] LOG: duration: 21541.082 ms statement: BEGIN
2023-03-30 13:04:12.164 UTC [1] LOG: checkpointer process (PID 23) was terminated by signal 9: Killed
2023-03-30 13:04:12.457 UTC [444] LOG: duration: 58248.342 ms parse <unnamed>: INSERT INTO simple_table (id, value) VALUES ($1, $2)
ON CONFLICT(id) DO UPDATE SET value = $2
2023-03-30 13:04:18.256 UTC [4155] LOG: duration: 42389.362 ms statement: COMMIT;
2023-03-30 13:04:23.720 UTC [1] LOG: terminating any other active server processes
2023-03-30 13:04:28.444 UTC [8916] FATAL: the database system is in recovery mode
2023-03-30 13:03:25.947 UTC [8899] LOG: duration: 17269.756 ms statement: BEGIN
2023-03-30 13:03:28.805 UTC [8874] LOG: duration: 19987.241 ms statement: BEGIN
2023-03-30 13:03:32.068 UTC [8326] LOG: duration: 21541.082 ms statement: BEGIN
2023-03-30 13:04:12.164 UTC [1] LOG: checkpointer process (PID 23) was terminated by signal 9: Killed
2023-03-30 13:04:12.457 UTC [444] LOG: duration: 58248.342 ms parse <unnamed>: INSERT INTO simple_table (id, value) VALUES ($1, $2)
ON CONFLICT(id) DO UPDATE SET value = $2
2023-03-30 13:04:18.256 UTC [4155] LOG: duration: 42389.362 ms statement: COMMIT;
2023-03-30 13:04:23.720 UTC [1] LOG: terminating any other active server processes
2023-03-30 13:04:28.444 UTC [8916] FATAL: the database system is in recovery mode
I'm curious where I should look for root causes.
Thanks!