Re: Memory Leak executing small queries without closing the connection - FreeBSD

Bill Moran <wmoran@xxxxxxxxxxxxxxxxx> · Sun, 13 Dec 2015 10:59:18 -0500

On Sun, 13 Dec 2015 16:35:08 +0100
Gerhard Wiesinger <lists@xxxxxxxxxxxxx> wrote:

> Hello Bill,
> 
> Thank you for your response, comments inline:
> 
> On 13.12.2015 16:05, Bill Moran wrote:
> > On Sun, 13 Dec 2015 09:57:21 +0100
> > Gerhard Wiesinger <lists@xxxxxxxxxxxxx> wrote:
> >> some further details from the original FreeBSD 10.1 machine:
> >>
> >> Mem: 7814M Active, 20G Inact, 2982M Wired, 232M Cache, 1661M Buf, 30M Free
> >> Swap: 512M Total, 506M Used, 6620K Free, 98% Inuse
> >>
> >>     PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME WCPU COMMAND
> >> 77941 pgsql         5  20    0  7925M  7296M usem    2 352:34 6.98%
> >> postgres: username dbnamee 127.0.0.1(43367)  (postgres)
> > <snip>
> >
> > I see no evidence of an actual leak here. Each process is basically using
> > the 7G of shared_buffers you have allocated in the config (which is only
> > 7G _total_ for all processes, since it's shared memory)
> 
> OK, but why do we then get: kernel: swap_pager_getswapspace(4): failed?

You haven't provided enough information to isolate that cause yet. What's
in the Postgres log? Surely it will have logged something when its request
for RAM was denied, and it should be more informational than the OS'
generic message.

> >> Out of memory:
> >> kernel: swap_pager_getswapspace(4): failed
> >> kernel: swap_pager_getswapspace(8): failed
> >> kernel: swap_pager_getswapspace(3): failed
> >>
> >> Main issue is IHMO (as far as I understood the FreeBSD Memory system)
> >> that 20G are INACTIVE. When I subtract the shared memory, even ~13GB
> >> should be available, but they are still allocated but inactive
> >> (INACTIVE). INACTIVE memory might be clean or dirty. As we get into out
> >> of memory situations it is likely that the memory is dirty (otherwise it
> >> would have been reused).
> > Not quite correct. Inactive memory is _always_ available for re-use.
> 
> Are you sure that's true?

Yes. Read The Design and Implementation of FreeBSD for the details.

> Monitoring inactive memory:
> cat vm_stat.sh
> #!/usr/bin/env bash
> 
> while [ 1 ]; do
>    date +%Y.%m.%d.%H.%M.%S
>    sysctl -a | grep vm.stats.vm.
>    sleep 1
> done
> 
> And even we get out of memory with swap_pager_getswapspace Inactive 
> Memory (from the log file) is around 20GB (doesn't go down or up)
> vm.stats.vm.v_inactive_count: 5193530 (*4k pages is around 20GB)
> 
> Then we have 20GB inactive memory, but we still get out of memory with 
> kernel: swap_pager_getswapspace(4): failed. Any ideas why?

Theory: If the planner decides it needs to do 30 sort operations for a
query, it will try to allocate 27G of RAM, which exceeds what's available,
and therefore never gets allocated. So you get the "out of space" message,
but the actual memory usage doesn't change.

> >> maintenance_work_mem = 512MB
> >> effective_cache_size = 10GB
> >> work_mem = 892MB
> > I expect that this value is the cause of the problem. The scenario you
> > describe below is sorting a large table on an unindexed column, meaning
> > it will have to use all that work_mem. I'd be interested to see the
> > output of:
> >
> > EXPLAIN ANALYZE SELECT * FROM t_random ORDER BY md5 LIMIT 100000;
> 
> That was only a test query, has nothing to do with production based 
> query. They are mostly SELECT/INSERTS/UPDATES on primary keys.

Then provide the _actual_ queries and the EXPLAIN ANALYZE and the table
layouts and basic data distribution of the actual cause. If your test
case is completely non-representative of what's happening, then you're
not going to get useful answers.

> > But even without that information, I'd recommend you reduce work_mem
> > to about 16M or so.
> 
> Why so low? E.g. sorting on reporting or some long running queries are 
> then done on disk and not in memory.

Even a simple query could involve multiple sorts, and you're allowing
each sort to use up to 890M of RAM (which is _not_ shared). As noted
earlier, even a moderately complex query could exceed the available
RAM on the system. But since you don't provide the actual queries and
tables causing problems, I can only guess. And since you appear to
have already decided what the cause of the problem is, then crafted
completely non-relevent queries that you think prove your point, I'm
not sure there's anything I can do to help you.

> >> wal_buffers = 8MB
> >> checkpoint_segments = 16
> >> shared_buffers = 7080MB
> >> max_connections = 80
> >> autovacuum_max_workers = 3
> > [snip]
> >
> >>> We are running PostgreSQL 9.4.5 on FreeBSD 10.1 and have multiple
> >>> worker processes connected via persistent connections to PostgreSQL,
> >>> they perform just simple queries with SELECT on primary keys and
> >>> simple INSERTS/UPDATES.
> > That's not at all the test scenario you show below. The scenario below
> > is a large sort operation on a non-indexed column, which is vastly
> > different than a single-row fetch based on an index.
> 
> Yes, that non indexed select was just for testing purporeses.

Completely useless. I can always craft some sort of nonsense case that
will show that something that no sane person would ever do doesn't work.
Especially if I set configuration limits (such as work_mem) to
unreasonably high values.

> >>> Normally nearly all the workers are idle but
> >>> they still consume the maximum configured work mem on the PostgreSQL
> >>> server and the memory is also resident.
> > I see no evidence of that in your top output. Each process has a
> > reference to the 7G of shared_buffers you allocated, which is memory
> > shared by all processes, and is expected. I'm not as familiar with
> > Linux top, but the results appear to be the same.
> 
> Yes, might be the case, but if it is n times 7G shared memory then we 
> have ~20GB Inactive Memory available, so plenty of memory. And why are 
> we getting: kernel: swap_pager_getswapspace(4): failed?

My advice:

First, set work_mem to 16M and see if the problem goes away. If you
still want to investigate exactly what was causing the problem after
that, then post the _actual_ scenerio that's causing it. I find it
_extremely_ unlikely that it's being caused by SELECT * FROM table
WHERE indexed_column = ?. And as a result, you should gather more
information on what queries are actual failing to execute. If you've
got a lot of data and are doing even moderately complex JOINs, you
could easily be allocating many gigs of sort space for each query.
With only 32G of RAM, a few dozen connections could quickly exhaust
that.

-- 
Bill Moran

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general