Joe Abbate <jma@xxxxxxxxxxxxxxxxx> writes: > Hello, > > We've been experiencing PG server process crashes about every other week > on a mostly read only website (except for a single insert/update on page > access). Typical log entries look like > > LOG: checkpointer process (PID 11200) was terminated by signal 9: Killed > LOG: terminating any other active server processes > > Other than the checkpointer, the server process that was terminated was > either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific > SELECT. > > The database is always recovered within a second and everything else > appears to resume normally. We're not certain about what triggers this, > but in several instances the web logs show an external bot issuing > multiple HEAD requests on what is logically a single page. The web > server logs show "broken pipe" and EOF errors, and PG logs sometimes > shows a number of "incomplete startup packet" messages before the > termination message. > > This started roughly when the site was migrated to Go, whose web > "processes" run as "goroutines", scheduled by Go's runtime (previously > the site used Python and Gunicorn to serve the pages, which probably > isolated the PG processes from a barrage of nearly simultaneous requests). > > As I understand it, the PG server processes doing a SELECT are spawned > as children of the Go process, so presumably if a "goroutine" dies, the > associated PG process would die too, but I'm not sure I grasp why that > would cause a recovery/restart. I also don't understand where the > checkpointer process fits in the picture (and what would cause it to die). > A signal 9 typically means something is explicitly killing processes. I would check your system logs in case something is killing processes due to running out of some resource (like memory). If it is a fairly recent Debian system, journalctl might be useful for checking. -- Tim Cross