On 2/15/21 1:15 PM, Joe Abbate wrote:
Hello,
We've been experiencing PG server process crashes about every other week
on a mostly read only website (except for a single insert/update on page
access). Typical log entries look like
LOG: checkpointer process (PID 11200) was terminated by signal 9: Killed
LOG: terminating any other active server processes
Have you looked at the system logs to see if the OOM killer is involved?
Other than the checkpointer, the server process that was terminated was
either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific
SELECT.
The database is always recovered within a second and everything else
appears to resume normally. We're not certain about what triggers this,
but in several instances the web logs show an external bot issuing
multiple HEAD requests on what is logically a single page. The web
server logs show "broken pipe" and EOF errors, and PG logs sometimes
shows a number of "incomplete startup packet" messages before the
termination message.
This started roughly when the site was migrated to Go, whose web
"processes" run as "goroutines", scheduled by Go's runtime (previously
the site used Python and Gunicorn to serve the pages, which probably
isolated the PG processes from a barrage of nearly simultaneous requests).
As I understand it, the PG server processes doing a SELECT are spawned
as children of the Go process, so presumably if a "goroutine" dies, the
associated PG process would die too, but I'm not sure I grasp why that
would cause a recovery/restart. I also don't understand where the
checkpointer process fits in the picture (and what would cause it to die).
For the record, this is on PG 11.9 running on Debian.
TIA,
Joe
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx