Re: postmaster deadlock while logging after syslogger exited

Tom Lane <tgl@xxxxxxxxxxxxx> · Thu, 16 Nov 2017 21:39:49 -0500

Andres Freund <andres@xxxxxxxxxxx> writes:
> On 2017-11-06 15:35:03 -0500, Tom Lane wrote:
>> David Pacheco <dap@xxxxxxxxxx> writes:
>>> I ran into what appears to be a deadlock in the logging subsystem.  It
>>> looks like what happened was that the syslogger process exited because it
>>> ran out of memory.  But before the postmaster got a chance to handle the
>>> SIGCLD to restart it, it handled a SIGUSR1 to start an autovacuum worker.
>>> That also failed, and the postmaster went to log a message about it, but
>>> it's blocked on the pipe that's normally connected to the syslogger,
>>> presumably because the pipe is full because the syslogger is gone and
>>> hasn't read from it.

>> Ugh.

> I'm somewhat inclined to say that one has to live with this if the
> system is so resource constrainted that processes barely using memory
> get killed.

David's report isn't too clear: did the syslogger process actually run
out of memory and exit of its own volition after an ENOMEM, or did it get
killed by the dreaded OOM killer?  In either case, it's unclear whether
it was really using an excessive amount of memory.  We have not heard
reports suggesting a memory leak in the syslogger, but maybe there is
one under unusual circumstances?

I think you're probably right that the real cause here is the OOM
killer just randomly seizing on the syslogger as a victim process;
although since the syslogger disconnects from shared memory, it's
not very clear why it would score high on the OOM killer's metrics.
The whole thing is definitely odd.

> We could work around a situation like that if we made postmaster use a
> *different* pipe as stderr than the one we're handing to normal
> backends. If postmaster created a new pipe and closed the read end
> whenever forking a syslogger, we should get EPIPEs when writing after
> syslogger died and could fall back to proper stderr or such.

I think that's nonsense, unfortunately.  If the postmaster had its
own pipe, that would reduce the risk of this deadlock because only
the postmaster would be filling that pipe, not the postmaster and
all its other children --- but it wouldn't eliminate the risk.
I doubt the increase in reliability would be enough to justify the
extra complexity and cost.

What might be worth thinking about is allowing the syslogger process to
inherit the postmaster's OOM-kill-proofness setting, instead of dropping
down to the same vulnerability as the postmaster's other child processes.
That presumes that this was an otherwise-unjustified OOM kill, which
I'm not quite sure of ... but it does seem like a situation that could
arise from time to time.

			regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general