On Monday, December 10, 2018, 7:45:07 PM EST, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Thomas Carroll <tomfecarroll@xxxxxxxxx> writes:
> On Monday, December 10, 2018, 5:50:22 PM EST, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> Also, as mentioned upthread, it'd be interesting to see if there's
>> a memory context dump showing up in your server log.
> There are many memory context dumps, and they are long (well over 100 lines apiece). To me these are unfamiliar and hard to read, but I tried to pick one that referenced all the same tables used in the function. I put one such dump (188 lines) here:
> https://drive.google.com/file/d/1QI4nffdZByIehb_-GULOagI_dKpKElg_/view
Thanks. Here's the smoking gun:
CacheMemoryContext: 221241400 total in 38 blocks; 152192 free (90 chunks); 221089208 used
This looks like something is leaking memory directly in CacheMemoryContext
(not in any of its child contexts). Can you confirm that all the memory
context dumps in your logs are showing similarly very-large total space
values for CacheMemoryContext?
This doesn't seem to be related to temp table usage, at least not
directly. (As a cross-check, I created and dropped 10000 temp tables in a
session, and the process's VM footprint didn't budge.) My guess is that
something else you are doing is tickling the leak.
You said you'd been able to reproduce this problem outside production.
Any chance you could boil that down to a sharable test case? I'd bet
long odds that it's not related to your data particularly, but rather
to your schema or something your function is doing, so you probably
could reproduce it with some dummy data (and, perhaps, not even very
much of that; repeating some DDL action many times is probably what
is making the leak grow to something noticeable).
If I had to guess at this point, I'd wonder if it's something to do
with partitioned tables. That code is still relatively wet behind
the ears, and memory leaks are the kind of bug I'd be expecting to
still be shaking out of it now. But you might not be using partitioned
tables at all. Anyway, anything you're doing with new-in-v10 features,
particularly schema-related features, would be a good bet for the
source of the problem.
To refresh your memory, the issue I encountered is that a long-persistent
connection that periodically issues function calls consumes more and more
memory - eventually causing memory exhaustion. Admittedly, it does take
weeks.
This email is about creating a shareable test case.
The test case is simple - so simple that I am surprised no one has encountered
this before, and therefore I am really more fearing that I am doing something
stupid in my effort to make it shareable and simple. In creating this example
I have not run the code till it crashed the server; instead I used the
measuring methodology you will see below (monitoring the smaps file).
I have put needed files here:
https://drive.google.com/open?id=1ff074Qxn7Bx9Ud9GRaegDfuJn_Tf_NTi
Table of contents:
testbed.public.backup # database dump, not compressed, includes "create database"
pg_hba.conf # self explanatory; allows local postgres to log in as trusted
postgresql.conf # self explanatory. nothing weird
loop_query.bash # sends a query via named pipe to Postgres once per second
To recreate the issue, load the "testbed" database into a Postgres server,
Then run as postgres:
cd /my/directory
mkfifo qpipe # named pipe conduit to persistent Postgres connection
cat > qpipe & # keep pipe from closing and terminating the PG connection
psql -h localhost -d testbed -U postgres < qpipe & # listener connection
./query_loop.bash # Send function call over and over once per second
# Observe output; should be 78 rows per second
# Switch to a new terminal session
# Identify the PID of the psql process
# Monitor with this command, but plug in the real PID for the 1234
grep "^Private" /proc/1234/smaps | awk '{a+=$2}END{print a/1024.0}'
> On Monday, December 10, 2018, 5:50:22 PM EST, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> Also, as mentioned upthread, it'd be interesting to see if there's
>> a memory context dump showing up in your server log.
> There are many memory context dumps, and they are long (well over 100 lines apiece). To me these are unfamiliar and hard to read, but I tried to pick one that referenced all the same tables used in the function. I put one such dump (188 lines) here:
> https://drive.google.com/file/d/1QI4nffdZByIehb_-GULOagI_dKpKElg_/view
Thanks. Here's the smoking gun:
CacheMemoryContext: 221241400 total in 38 blocks; 152192 free (90 chunks); 221089208 used
This looks like something is leaking memory directly in CacheMemoryContext
(not in any of its child contexts). Can you confirm that all the memory
context dumps in your logs are showing similarly very-large total space
values for CacheMemoryContext?
This doesn't seem to be related to temp table usage, at least not
directly. (As a cross-check, I created and dropped 10000 temp tables in a
session, and the process's VM footprint didn't budge.) My guess is that
something else you are doing is tickling the leak.
You said you'd been able to reproduce this problem outside production.
Any chance you could boil that down to a sharable test case? I'd bet
long odds that it's not related to your data particularly, but rather
to your schema or something your function is doing, so you probably
could reproduce it with some dummy data (and, perhaps, not even very
much of that; repeating some DDL action many times is probably what
is making the leak grow to something noticeable).
If I had to guess at this point, I'd wonder if it's something to do
with partitioned tables. That code is still relatively wet behind
the ears, and memory leaks are the kind of bug I'd be expecting to
still be shaking out of it now. But you might not be using partitioned
tables at all. Anyway, anything you're doing with new-in-v10 features,
particularly schema-related features, would be a good bet for the
source of the problem.
regards, tom lane
Thanks for your response and sorry for the delay in getting back to this.
Because holidays.
To refresh your memory, the issue I encountered is that a long-persistent
connection that periodically issues function calls consumes more and more
memory - eventually causing memory exhaustion. Admittedly, it does take
weeks.
This email is about creating a shareable test case.
The test case is simple - so simple that I am surprised no one has encountered
this before, and therefore I am really more fearing that I am doing something
stupid in my effort to make it shareable and simple. In creating this example
I have not run the code till it crashed the server; instead I used the
measuring methodology you will see below (monitoring the smaps file).
I have put needed files here:
https://drive.google.com/open?id=1ff074Qxn7Bx9Ud9GRaegDfuJn_Tf_NTi
Table of contents:
testbed.public.backup # database dump, not compressed, includes "create database"
pg_hba.conf # self explanatory; allows local postgres to log in as trusted
postgresql.conf # self explanatory. nothing weird
loop_query.bash # sends a query via named pipe to Postgres once per second
To recreate the issue, load the "testbed" database into a Postgres server,
configured so that psql does not need to send a password for user postgres.
I created the server with this:
initdb $PGDATA --locale="en_US.UTF-8" -E "UTF-8" -k
Then run as postgres:
cd /my/directory
mkfifo qpipe # named pipe conduit to persistent Postgres connection
cat > qpipe & # keep pipe from closing and terminating the PG connection
psql -h localhost -d testbed -U postgres < qpipe & # listener connection
./query_loop.bash # Send function call over and over once per second
# Observe output; should be 78 rows per second
# Switch to a new terminal session
# Identify the PID of the psql process
# Monitor with this command, but plug in the real PID for the 1234
grep "^Private" /proc/1234/smaps | awk '{a+=$2}END{print a/1024.0}'
Give it a few minutes, and you will observe memory usage climbing steadily
(if my measurement technique is good). From last night to this morning I
observed memory consumption grow from 3.5 to 46.6 MB.
Tested on Postgres version 10.5.
Again, thanks for any help.
Tom Carroll