Hello Christoph,
Thanks for your answer and the suggestions, it already helped me out a lot!
On 2024-09-14 22:11, Christoph Moench-Tegeder wrote:
Hi,
## Thomas Ziegler (thomas.ziegler@xxxxxxxxxxxxxxxx):
There's a lot of information missing here. Let's start from the top.
I have had my database killed by the kernel oom-killer. After that I
set turned off memory over-committing and that is where things got weird.
What exactly did you set? When playing with vm.overcommit, did you
understand "Committed Address Space" and the workings of the
overcommit accounting? This is the document:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/mm/overcommit-accounting.rst
Hint: when setting overcommit_memory=2 you might end up with way
less available adress space than you thought you would. Also keep
an eye on /proc/meminfo - it's sometimes hard to estimate "just off
your cuff" what's in memory and how it's mapped. (Also, anything
else on that machine which might hog memory?).
I set overcommit_memory=2, but completely missed 'overcommit_ratio'.
That is most probably why the database got denied the RAM a lot sooner
than I expected.
Finally, there's this:
2024-09-12 05:18:36.073 UTC [1932776] LOG: background worker "parallel worker" (PID 3808076) exited with exit code 1
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
2024-09-12 05:18:36.083 UTC [1932776] LOG: background worker "parallel worker" (PID 3808077) was terminated by signal 6: Aborted
That "std::bad_alloc" sounds a lot like C++ and not like the C our
database is written in. My first suspicion would be that you're using
LLVM-JIT (unless you have other - maybe even your own - C++ extensions
in the database?) and that in itself can use a good chunk of memory.
And it looks like that exception bubbled up as a signal 6 (SIGABRT)
which made the process terminate immediately without any cleanup,
and after that the server has no other chance than to crash-restart.
Except for pgAudit, I don't have any extensions, so it is probably the
JIT. I had no idea there was a JIT, even it should have been obvious.
Thanks for pointing this out!
Is the memory the JIT takes limited by 'work_mem' or will it just take
as much memory as it needs?
I recommend starting with understanding the actual memory limits
as set by your configuration (personally I believe that memory
overcommit is less evil than some people think). Have a close look
at /proc/meminfo and if possible disable JIT and check if it changes
anything. Also if possible try starting with only a few active
connections and increase load carefully once a steady state (in
terms of memory usage) has been reached.
Yes, understanding the memory limits is what I was trying to do.
I was questioning my understanding but it seems it was Linux that
tripped me,
or more like my lack of understanding there, rather than the database.
Memory management and /proc/meminfo still manages to confuse me.
Again, thanks for your help!
Cheers,
Thomas
p.s.: To anybody who stumbles upon this in the future,
if you set `overcommit_memory=2`, don't forget `overcommit_ratio`.