On Tue, Jan 23, 2018 at 5:59 AM, Rambabu V <ram.wissen@xxxxxxxxx> wrote:
> > cat PostgreSQL-2018-01-23_060000.csv|grep FATAL
$ free -mhtotal used free shared buffers cachedMem: 58G 58G 358M 16G 3.6M 41G-/+ buffers/cache: 16G 42GSwap: 9.5G 687M 8.9G
This does not seem like it should be a problem. Is this data collected near the time of the failure?
work_mem = 256MB # min 64kBmax_connections = 600
These look pretty high, especially in combination. Why do you need that number of connections? Could you use a connection pooler instead? Or do just have an application bug (leaked connection handles) that needs to be fixed? Why do you need that amount of work_mem?
ps -ef|grep postgres|grep idle|wc -l171ps -ef|grep postgres|wc -l206
How close to the time of the problem was this recorded? How many of the idle are 'idle in transaction'?
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command109063 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 39:55.61 postgres: test sss 10.20.2.228(55174) idle24910 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 27:45.35 postgres: testl sss 10.20.2.228(55236) idle115539 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 28:22.89 postgres: test sss 10.20.2.228(55184) idle9816 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 40:19.57 postgres: test sss 10.20.2.228(55216) idle
How close to the time of the problem was this recorded? Nothing here seems to be a problem, because almost all the memory they have resident is shared memory.
It looks like all your clients decide to run a memory hungry query simultaneously, consume a lot of work_mem, and cause a problem. Then by the time you notice the problem and start collecting information, they are done and things are back to normal.
Cheers,
Jeff