Re: postmaster deadlock while logging after syslogger exited

David Pacheco <dap@xxxxxxxxxx> · Mon, 4 Dec 2017 14:55:02 -0800

Thanks again for helping out.

On Mon, Dec 4, 2017 at 2:12 PM, Andres Freund <andres@xxxxxxxxxxx> wrote:
On 2017-12-04 13:57:52 -0800, David Pacheco wrote:

> On Mon, Dec 4, 2017 at 12:23 PM, Andres Freund <andres@xxxxxxxxxxx> wrote:

> > FWIW, I'd like to see a report of this around the time the issue

> > occurred before doing anything further here.

> >

>

>

> This failure begins when this process exits, so the best you could get is

> memory in use immediately before it exited.  I obviously can't get that now

> for the one instance I saw weeks ago, but maybe PostgreSQL could log

> information about current memory usage when it's about to exit because of

> ENOMEM?

It already does so.

In that case, do you have the information you need in the log that I posted earlier in the thread?
(https://gist.githubusercontent.com/davepacheco/c5541bb464532075f2da761dd990a457/raw/2ba242055aca2fb374e9118045a830d08c590e0a/gistfile1.txt)

What I was wondering about was the memory usage some time before it

dies. In particular while the workload with the long query strings is

running. ps output would be good, gdb'ing into the process and issuing

MemoryContextStats(TopMemoryContext) would be better.

Would it make sense for PostgreSQL to periodically sample the memory used by the current process, keep a small ringbuffer of recent samples, and then log all of that when it exits due to ENOMEM?

One does not know that one is going to run into this problem before it happens, and it may not happen very often.  (I've only seen it once.)  The more PostgreSQL can keep the information needed to understand something like this after the fact, the better -- particularly since the overhead required to maintain this information should not be that substantial.

> That way if anybody hits a similar condition in the future, the

> data will be available to answer your question.

>

> That said, I think the deadlock itself is pretty well explained by the data

> we have already.

Well, it doesn't really explain the root cause, and thus the extent of

the fixes required. If the root cause is the amount of memory used by

syslogger, we can remove the deadlock, but the experience is still going

to be bad. Obviously better, but still bad.

Fair enough.  But we only know about one problem for sure, which is the deadlock.  There may be a second problem of the syslogger using too much memory, but I don't think there's any evidence to point in that direction.  Once the whole system is out of memory (and it clearly was, based on the log entries), anything that tried to allocate would fail, and the log reflects that a number of different processes did fail to allocate memory.  I'd help investigate this question, but I have no more data about it, and I'm not sure when I will run into this again.

Thanks,
Dave