Re: Out of memory error on automatic vacuum

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> · Mon, 18 Nov 2019 13:25:28 +0100

On Mon, Nov 18, 2019 at 12:41:27PM +0100, Ekaterina Amez wrote:
Hi all,

This morning I was checking postgres servers logs, looking for errors 
(we've recently upgraded them and changed default config) and long 
running queries when I found one of the servers had really big logs 
since yesterday.  It was giving the error of this mail's subject: out 
of memory, failed on request of size XXX on automatic vacuum of table 
YYY. A quick search revealed me some postgresql-lists messages talking 
about work_mem and shared_buffers configuration options, some kernel 
config options too. Although all of them were messages from several 
years ago, I decided to cut my shared_buffers configured value and 
restart server: now it looks like error is gone. But I'd like to 
understand what's beyond the logged error (it's really long and refers 
to things about inner functionalities that I'm missing), how to detect 
what config options are possibly conflicting and, most important, I 
want to know if I've solved it right.

Unfortunately that's hard to say, without further data. The "out of
memory" errors simply mean we called malloc() and it returned NULL,
because the kernel was unable to provide the memory.

There probably were other processes using all the available RAM (the
limit depends on various config values, e.g. overcommit). What were
these processes doing we don't know :-(

For example, there might be multiple complex queries, allocating
several work_mem each, using quite a bit of memory. Or there might be a
runaway query doing HashAgg allocating much more memory than predicted.
Or maybe there was running a completely separate job (say, backup)
allocating a lot of memory or dirtying data in page cache.

There are countless options what might have happened. The memory context
stats are nice, but it's just a snapshot from one particular process,
and it does not seem very interesting (the total is just ~1MB, so
nothing extreme). We still don't know what else was running.

Lowering shared_buffers certainly does reduce the memory pressure in
general, i.e. there is 1GB of work for use by processes. It may be
sufficient, hard to guess.

I don't know if work_mem 64MB is too low, becuase it depends on what
queries you're running etc. But you probably don't wat to increase that,
as it allows processes to use more memory when executing queries, i.e.
it increases memory pressure and makes OOM more likely.

So you need to watch system monitoring, see how much memory is being
used (excluding page cache) and consider reducing work_mem and/or
max_connections if it's too close.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services