Re: checkpointer and other server processes crashing

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Mon, 15 Feb 2021 13:29:28 -0800

On 2/15/21 1:15 PM, Joe Abbate wrote:
Hello,

We've been experiencing PG server process crashes about every other week 
on a mostly read only website (except for a single insert/update on page 
access).  Typical log entries look like

LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
LOG:  terminating any other active server processes

Have you looked at the system logs to see if the OOM killer is involved?

Other than the checkpointer, the server process that was terminated was 
either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific 
SELECT.

The database is always recovered within a second and everything else 
appears to resume normally.  We're not certain about what triggers this, 
but in several instances the web logs show an external bot issuing 
multiple HEAD requests on what is logically a single page.  The web 
server logs show "broken pipe" and EOF errors, and PG logs sometimes 
shows a number of "incomplete startup packet" messages before the 
termination message.

This started roughly when the site was migrated to Go, whose web 
"processes" run as "goroutines", scheduled by Go's runtime (previously 
the site used Python and Gunicorn to serve the pages, which probably 
isolated the PG processes from a barrage of nearly simultaneous requests).

As I understand it, the PG server processes doing a SELECT are spawned 
as children of the Go process, so presumably if a "goroutine" dies, the 
associated PG process would die too, but I'm not sure I grasp why that 
would cause a recovery/restart.  I also don't understand where the 
checkpointer process fits in the picture (and what would cause it to die).

For the record, this is on PG 11.9 running on Debian.

TIA,

Joe

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx