Re: PostgreSQL 14.4 ERROR: out of memory issues

Aleš Zelený <zeleny.ales@xxxxxxxxx> · Thu, 21 Jul 2022 00:26:54 +0200

Hello,

po 18. 7. 2022 v 21:04 odesílatel Tom Lane <tgl@xxxxxxxxxxxxx> napsal:
=?UTF-8?B?QWxlxaEgWmVsZW7DvQ==?= <zeleny.ales@xxxxxxxxx> writes:

> postgres=# show shared_buffers ;

>  shared_buffers

> ----------------

>  10GB

> (1 row)

Oh!  In that case, there is absolutely nothing to see here.

This chunk:

> 00007fd9b0551000 10827040K rw-s- zero (deleted)

must be your shared-memory region, and it's accounting for just

about all of the process's memory space.  There is no reason

to suspect a leak.

I agree this 10 GB is expected to be accounted for shared buffers (just a size correlation) and therefore no memory leak here.

I think you've been fooled by the fact that "top" and some other

tools frequently don't report a shared-memory page as part of a

process's usage until that process touches that particular page.

Thus, the reported memory usage of a Postgres backend will grow

over time as it randomly happens to touch different buffers within

the shared buffer arena.  That does not constitute a leak, but

it does make such tools next door to useless for detecting

actual leaks :-(.  You can only believe there's a leak if the

reported usage doesn't level off after reaching the vicinity of

your shared memory size.

On the other hand, if we do have 10GB of shared buffers plus ~14MB reported by Top Memory Context (even if /proc/<pid>/status reports RssAnon memory when touched, the process VM peak is ~12 GB?
-bash-4.2$ cat 20220718_200230.status.17048.log
Name:	postmaster
Umask:	0077
State:	S (sleeping)
Tgid:	17048
Ngid:	0
Pid:	17048
PPid:	4476
TracerPid:	0
Uid:	26	26	26	26
Gid:	26	26	26	26
FDSize:	1024
Groups:	26 
VmPeak:	12477752 kB
VmSize:	12439740 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	 9811560 kB
VmRSS:	 9811560 kB
RssAnon:	 1283964 kB
RssFile:	   15784 kB
RssShmem:	 8511812 kB
VmData:	 1282860 kB
VmStk:	     132 kB
VmExe:	    7816 kB
VmLib:	   85312 kB
VmPTE:	   23716 kB
VmSwap:	       0 kB
Threads:	1
SigQ:	0/160197
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000400000
SigIgn:	0000000001701800
SigCgt:	00000001a0006287
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000001fffffffff
CapAmb:	0000000000000000
NoNewPrivs:	0
Seccomp:	0
Speculation_Store_Bypass:	thread vulnerable
Cpus_allowed:	ffffff
Cpus_allowed_list:	0-23
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:	0-1
voluntary_ctxt_switches:	5273344
nonvoluntary_ctxt_switches:	13651

I'd expect VMPeak size close to 10GB  (suppose that all shared memory pages were touched) shared_buffers plus ~14MB of private memory, but
VMPeak: 12477752 [kB] - 10485760 shared buffers [kB] = 1991992 kB Which is a bit more than reported RssAnon:	 1283964 kB  memory for the process, but very close.

pmap taken in the same time:

-bash-4.2$ cat 20220718_200230.pmap.17048.log | grep -e anon -e deleted
0000000000db3000    200K rw---   [ anon ]
0000000002d15000    504K rw---   [ anon ]
0000000002d93000 934476K rw---   [ anon ]  <<<= this is suspected to be responsible for the OutOfMemory -> failing fork(), reinitialize the cluser
00007fd989776000 311304K rw---   [ anon ]  <<<= togehter with this one
00007fd9a8d75000  32772K rw---   [ anon ]
00007fd9acb65000     20K rw---   [ anon ]
00007fd9affc1000    372K rw---   [ anon ]
00007fd9b023d000   1028K rw---   [ anon ]
00007fd9b054b000     24K rw---   [ anon ]
00007fd9b0551000 10827040K rw-s- zero (deleted)   <<<= expected to be shared_buffers
00007fdc45f18000     16K rw---   [ anon ]
00007fdc4d50c000    184K rw---   [ anon ]
00007fdc4de16000      4K rw---   [ anon ]
00007fdc4e03c000      8K rw---   [ anon ]
00007fdc4fd1e000     84K rw---   [ anon ]
00007fdc4ff6f000      8K rw---   [ anon ]
00007fdc508d7000      8K rw---   [ anon ]
00007fdc50d59000      4K rw---   [ anon ]
00007fdc5139e000      8K rw---   [ anon ]
00007fdc520f3000     40K rw---   [ anon ]
00007fdc52902000     20K rw---   [ anon ]
00007fdc52ead000     16K rw---   [ anon ]
00007fdc532af000      4K rw---   [ anon ]
00007fdc53e2b000     16K rw---   [ anon ]
00007fdc544db000     16K rw---   [ anon ]
00007fdc54cc8000      8K rw---   [ anon ]
00007fdc54f1a000   1588K rw---   [ anon ]
00007fdc550dc000    104K rw---   [ anon ]
00007fdc550fc000      8K r-x--   [ anon ]
00007fdc550ff000      4K rw---   [ anon ]
00007fdc55102000      4K rw---   [ anon ]
00007fffd38f0000      8K r-x--   [ anon ]
ffffffffff600000      4K r-x--   [ anon ]

The machine has 40GB of RAM. It is a dedicated DB server, only a single PG instance is running there. Since we again run out of memory from my last reply, now it looks  healthy:

-bash-4.2$ psql -c "select count(*) from pg_stat_activity"
 count 
-------
   336
(1 row)

$-bash-4.2$ sudo smem -u
User     Count     Swap      USS      PSS      RSS 
chrony       1        0      512      515      872 
dbus         1        0     1064     1074     1432 
postfix      2        0     2752     3122     5540 
polkitd      1        0     9152     9276     9816 
postgres-exporter     1        0    11912    11912    11916 
zelenya      9        0    34320    35550    43648 
sql_exporter     1        0    46332    46332    46340 
root        28        0   267920   274273   298388 
postgres   336        0  5904884 16399946 143527180 

-bash-4.2$ free -h
              total        used        free      shared  buff/cache   available
Mem:            39G        7,3G        1,2G         10G         30G         21G
Swap:            0B          0B          0B

The processes suspect to be responsible for memory exhaustion (using the RssAnon memory from /proc/<pid>/status files always belong to a single user:

-bash-4.2$ psql -c "select count(*) from pg_stat_activity where usename='app_feed'"
 count 
-------
    67
(1 row)

So the available 21GB (under normal conditions used by the kernel as file system page cache) is not enough if 20 of the processes allocate 1GB private memory -> OS CommitLimit is reached and OutOfMemory error happens.

So back to why you're getting these out-of-memory failures:

we still don't have much info about that.  I wonder whether

there is something watching the total usage reported by "top"

and taking action based on that entirely-misleading number.

The kernel itself should be aware that there's no extra memory

pressure from N backends all using the same shared memory

segment, but it can be hard for outside tools to know that.

There is no tool taking action when high memory usage is detected, OS kernel provides all the memory requested by processes until it happens, that a fork() fails (usually when autovacuum launcher attempts to spawn a new autovacuum worker as these are most frequently started new processes. But the failure comes from the failed fork() due to a lack of memory for new process creation.

2022-07-01 06:51:02 CEST [4476]: [26-1] user=,db=,host=,app= LOG:  could not fork autovacuum worker process: Cannot allocate memory
2022-07-01 06:51:03 CEST [4476]: [27-1] user=,db=,host=,app= LOG:  could not fork worker process: Cannot allocate memory
2022-07-01 06:51:03 CEST [4476]: [28-1] user=,db=,host=,app= LOG:  could not fork worker process: Cannot allocate memory
2022-07-01 06:51:03 CEST [18882]: [1332-1] user=app_feed,db=prematch,host=172.25.0.231,app=[unknown] ERROR:  could not resize shared memory segment "/PostgreSQL.1824635724" to 204064 bytes: No space left on device

So OS kernel was short of available memory.

Memory Commit limit on the system:
-bash-4.2$ sysctl -q vm.overcommit_memory
vm.overcommit_memory = 2
-bash-4.2$ sysctl -q vm.overcommit_ratio
vm.overcommit_ratio = 95

swap is disabled:
-bash-4.2$ swapon -s
-bash-4.2$ 

CommitLimit = RAM * overcommit_ratio / 100 + swap  ~ 37GB (ignoring total_huge_TLB as huge pages are not used) as can be seen in meminfo:

-bash-4.2$ cat /proc/meminfo 
MemTotal:       41033756 kB
MemFree:         1549608 kB
MemAvailable:   24782516 kB
Buffers:            8468 kB
Cached:         33667252 kB
SwapCached:            0 kB
Active:         24600744 kB
Inactive:       12524428 kB
Active(anon):   14131536 kB
Inactive(anon):   130116 kB
Active(file):   10469208 kB
Inactive(file): 12394312 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              1876 kB
Writeback:            16 kB
AnonPages:       3449376 kB
Mapped:         10796800 kB
Shmem:          10811896 kB
Slab:             987832 kB
SReclaimable:     813916 kB
SUnreclaim:       173916 kB
KernelStack:       13648 kB
PageTables:       823236 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    38982068 kB
Committed_AS:   19261792 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      256724 kB
VmallocChunk:   34338043900 kB
Percpu:             5632 kB
HardwareCorrupted:     0 kB
AnonHugePages:     65536 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      110464 kB
DirectMap2M:     6180864 kB
DirectMap1G:    37748736 kB

From the node-exporter metric, I can see that the component that grows over time until the commit limit is reached and the cluster is reinitialized is "Active(anon)". Diving into /proc/<pid>/status I can see multiple processes of the app_feed DB user RssAnon memory growth over time and the sum of the increments matches the "Active(anon)" growth seen in the global view in /proc/meminfo.

This observed growth together with the anonymous memory segments (on top of shared buffers mentioned above) was the reason why after the upgrade I suspect an issue in PG14.4.

At this point I suspect that PG 14.3 vs 14.4 is a red herring,

and what you should be looking into is what else you updated

at about the same time, particularly in the line of container

management tools or the like (if you use any).

As mentioned, the machine is a dedicated DB server serving a single PG instance. No containers are in place, no cgroup setup (neither memory, CPU limits), clean CentOS7 + epel-release + pg packages (node_exporter and postgres_exporter for monitoring).

The only change on that machine was an upgrade from PG14.3 to PG14.4 (using PGDG RPM repo packages) and from that moment, the memory usage and out-of-memory issues started, the first time less than 24 hours after the upgrade.

I'm far from being sure it is a Postgres memory leak, my primary intention is to provide as much data as I could to allow helping me identify this issue's root cause. Wen, I'll be back from my vacation (2 weeks) I'd like to downgrade to 14.3 (and we have to resist using index concurrently operations) to see whether such memory allocations disappear or not.

Any hints or tips on what to do to analyze it better (what data from OS we have to collect over time... (ideally avoiding the downgrade exercise) are welcome.

Kind regards Ales Zeleny