Responding to several points below.
Tom Lane wrote:
> David's report isn't too clear: did the syslogger process actually run
> out of memory and exit of its own volition after an ENOMEM, or did it get
> killed by the dreaded OOM killer? In either case, it's unclear whether
> it was really using an excessive amount of memory. We have not heard
> reports suggesting a memory leak in the syslogger, but maybe there is
> one under unusual circumstances?
I'm running on illumos, where there is no OOM killer.
I expect what happened is that the syslogger process attempted to allocate
memory, failed because the system was low, and explicitly exited. That's
consistent with an exited process, no core file generated, and the "FATAL" "out
of memory" entries in the log I posted.
Of course, that doesn't mean the syslogger was using much memory. It means only
that it attempted to allocate any memory in a context where it could not handle
failing.
Andres Freund wrote:
> I'm somewhat inclined to say that one has to live with this if the
> system is so resource constrainted that processes barely using memory
> get killed.
Again, there's no OOM killer here, so the process has the opportunity to respond
differently. I can understand the database not functioning well when the system
is out of memory, but this half-functioning, half-deadlocked state is quite
disruptive. By comparison, while not ideal, if the syslogger had raised SIGABRT
instead, that would have been significantly better for my use-case. Our process
manager would have terminated all of the processes associated with the database
and restarted it.
Tom Lane wrote:
> Hm, so that's another angle David didn't report on: is it possible that
> his workload could have resulted in a very large volume of incomplete
> in-progress log messages?
Yes. I mentioned in my Nov 6 mail that large log messages over a short period
appear to have been a major contributing factor:
> Here are all of the entries in the PostgreSQL log from 23:19:12 until the top of
> the next hour:
...
> There are also many log entries for some very long SQL queries. I'm sure that
> contributed to this problem by filling up the pipe. I was able to extract the
> contents of the pipe while the system was hung, and it was more of these giant
> query strings.
Andres Freund wrote:
> Note that there's plenty of cases where you could run into this even
> without being unable to fork new processes. You'd e.g. could also run
> into this while logging the exit of some other subprocess or such,
> there's enough ereports in postmaster.
This is a critical point. As far as I can tell, all that's necessary for this
deadlock to occur is:
- the syslogger is unable to make forward progress (e.g., because it segfaulted)
- enough other processes write to the pipe that it fills up before the
postmaster can restart the syslogger
- the postmaster attempts to write to the log as part of any of its signal
handling operations
It seems to me that the same thing can happen if there were a bug in the
syslogger that resulted in a segfault around the same time that the postmaster
attempted to start an autovacuum worker, for example (if there was also
reasonably high log traffic).
Thanks,
Dave