Re: Problems with PG 9.3

Dhruv Shukla <dhruvshukla82@xxxxxxxxx> · Tue, 26 Aug 2014 10:27:27 -0500

Hello All,We changed the settings to 

max_connections=200
work_mem=1Gb
temp_mem=1Gb
shared_buffers=70GB
maintenance_work_mem = 2GB

effective_cache_size=180GB
max_locks_per_transaction=128
autovacuum_max_workers = 12

checkpoint_segments = 256

And ran an strace on one of the programs that was getting stuck and dying by itself. Here is the log of strace (Ran from another machine)

poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0LSINFO\0C00000\0M            -"..., 16384, 0, NULL, NULL) = 77
write(2, "INFO:              ---> did 2\n", 30INFO:              ---> did 2

) = 30
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0MSINFO\0C00000\0M            -"..., 16384, 0, NULL, NULL) = 78
write(2, "INFO:              ---> did 14\n", 31INFO:              ---> did 14

) = 31
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>

On viewing server logs on postgres server found the error as 
FATAL: connection to client lost

There has been some improvements, like few of the getting stuck DB functions ran good. But the main ones are still failing.

-Dhruv

On Mon, Aug 25, 2014 at 1:48 PM, Dhruv Shukla <dhruvshukla82@xxxxxxxxx> wrote:

Kevin,This has been valuable information. Thanks a lot for it.

I was able to reduce the max connections down to 200, but still the other parameters I wanted to run over with you before I go ahead and make changes.

Now here comes the dilemma for me, we run memory intensive processes in morning using easily resident memory of 4-10G of memory for processing. I thought of using work_mem =1Gb so that should commit to around 200GB of RAM. and then there are other processes which uses temporary tables in which around 200K records are stored in temporary tables 

A temp_mem setting of 1GB sounds good, but i am slightly worried if it could delay the process for those memory extensive processes.

Planning on getting shared buffers down to 60GB from 80GB.

OS:
vm.dirty_background_ratio = 10

In regards to vm.dirty_background_ratio, what number do you think will be reasonable for such kind of scenarios?

Thanks,
Dhruv 

On Mon, Aug 25, 2014 at 12:53 PM, Kevin Grittner <kgrittn@xxxxxxxxx> wrote:

Dhruv Shukla <dhruvshukla82@xxxxxxxxx> wrote:

> Kevin Grittner <kgrittn@xxxxxxxxx> wrote:

>> Dhruv Shukla <dhruvshukla82@xxxxxxxxx> wrote:

>>> Other informational details about configurations are:

>>> shared_buffers = 80GB

>>> temp_buffers = 2GB

>>> work_mem = 2GB

>>> maintenance_work_mem = 16GB

>> How much RAM is on the machine (or VM)?

> Currently we have a max connection setting for 1000 connections.

The temp_buffers setting is a limit *per connection*, and once RAM

for a temporary table is allocated to a connection it is never

released; so even when all connections are idle they could be

setting aside 2TG of RAM for possible use for caching temporary

tables.

The work_mem setting is an attempt to limit RAM per node of an

executing query; one connection can create many allocations of the

size set for work_mem.  Since not all queries have nodes that

require such allocations, and not all connections are necessarily

active at the same time, a general rule of thumb is to allow for

one work_mem allocation per connection allowed by max_connections.

So these settings can easily result in another 2TB of allocations,

beyond the temp_buffers mentioned above.

A high shared_buffers setting can result in "write gluts" at the OS

level when a high percentage of that memory becomes dirty and is

dumped to the OS in a short time.  This can result in the OS

appearing to "lock up" for as long as it takes it to flush all of

the dirty data to disk.  I don't know what your system is capable

of, but most database servers I've seen fall between 20MB/second

and 200MB/second.  That would correspond to apparent stalls lasting

between 6.8 minutes and 1.1 hours.  By tweaking the bgwriter

settings and the OS dirty settings you can drastically reduce this,

but I don't think you've indicated having done that, so 80GB can be

expected to cause apparent lockups of those durations.

A high shared_buffers setting makes you more vulnerable to long

stalls because of transparent huge page operations of the OS.

> And RAM on server is 384GB RAM.

And overcommitting RAM by a factor of more than 10x is not going to

be pretty in any event.

If I could not reduce max_connections, I would set work_mem to no

more than 100MB and temp_buffers to no more than 50MB.  I would

drop maintenance_work_mem to 2GB.  I would probably drastically

reduce shared_buffers and would most certainly make autovacuum and

bgwriter more aggressive than the default.

If you make those changes and still see a problem, only then is it

worth looking at other possible causes.

--

Kevin Grittner

EDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

-- 
Regards
Dhruv
404-551-2578

-- 
Regards
Dhruv
404-551-2578