El 18/11/19 a las 13:25, Tomas Vondra
escribió:
On Mon, Nov
18, 2019 at 12:41:27PM +0100, Ekaterina Amez wrote:
Hi all,
This morning I was checking postgres servers logs, looking for
errors (we've recently upgraded them and changed default config)
and long running queries when I found one of the servers had
really big logs since yesterday. It was giving the error of
this mail's subject: out of memory, failed on request of size
XXX on automatic vacuum of table YYY. A quick search revealed me
some postgresql-lists messages talking about work_mem and
shared_buffers configuration options, some kernel config options
too. Although all of them were messages from several years ago,
I decided to cut my shared_buffers configured value and restart
server: now it looks like error is gone. But I'd like to
understand what's beyond the logged error (it's really long and
refers to things about inner functionalities that I'm missing),
how to detect what config options are possibly conflicting and,
most important, I want to know if I've solved it right.
Unfortunately that's hard to say, without further data. The "out
of
memory" errors simply mean we called malloc() and it returned
NULL,
because the kernel was unable to provide the memory.
This (kernel unable to provide memory) was because no more RAM was
available to allocate? It was because PG process did not have more
memory assigned ready to use? Or is something unknown because it
depends on the situations where the error is thrown?
There probably were other processes using all the available RAM
(the
limit depends on various config values, e.g. overcommit). What
were
these processes doing we don't know :-(
For example, there might be multiple complex queries, allocating
several work_mem each, using quite a bit of memory. Or there might
be a
runaway query doing HashAgg allocating much more memory than
predicted.
Or maybe there was running a completely separate job (say, backup)
allocating a lot of memory or dirtying data in page cache.
I've looked at cron and I've seen a scheduled process that
finished a bit before the error began to log (o couple of minutes
or so). Errors began on Sunday morning and this machine doesn't
have much workload on work days, and less on weekend. I'll keep an
eye on this log and if the problem appears again I'll try to track
database activity and machine activity.
There are countless options what might have happened. The memory
context
stats are nice, but it's just a snapshot from one particular
process,
and it does not seem very interesting (the total is just ~1MB, so
nothing extreme). We still don't know what else was running.
When you talk about ~1MB are you getting this size from log lines
like this?
<index>: 1024 total in 1 blocks; 476 free (0
chunks); 548 used
Lowering shared_buffers certainly does reduce the memory pressure
in
general, i.e. there is 1GB of work for use by processes. It may be
sufficient, hard to guess.
I don't know if work_mem 64MB is too low, becuase it depends on
what
queries you're running etc. But you probably don't wat to increase
that,
as it allows processes to use more memory when executing queries,
i.e.
it increases memory pressure and makes OOM more likely.
So you need to watch system monitoring, see how much memory is
being
used (excluding page cache) and consider reducing work_mem and/or
max_connections if it's too close.
I'll do, thanks for your suggestions.
regards
Regards,
Ekaterina
|