Hello All,
We changed the settings to
max_connections=200
work_mem=1Gb
temp_mem=1Gb
shared_buffers=70GB
maintenance_work_mem = 2GB
effective_cache_size=180GB
max_locks_per_transaction=128
autovacuum_max_workers = 12
checkpoint_segments = 256
And ran an strace on one of the programs that was getting stuck and dying by itself. Here is the log of strace (Ran from another machine)
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0LSINFO\0C00000\0M -"..., 16384, 0, NULL, NULL) = 77
write(2, "INFO: ---> did 2\n", 30INFO: ---> did 2
) = 30
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0MSINFO\0C00000\0M -"..., 16384, 0, NULL, NULL) = 78
write(2, "INFO: ---> did 14\n", 31INFO: ---> did 14
) = 31
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
On viewing server logs on postgres server found the error as
FATAL: connection to client lost
There has been some improvements, like few of the getting stuck DB functions ran good. But the main ones are still failing.
-Dhruv
On Mon, Aug 25, 2014 at 1:48 PM, Dhruv Shukla <dhruvshukla82@xxxxxxxxx> wrote:
Kevin,This has been valuable information. Thanks a lot for it.I was able to reduce the max connections down to 200, but still the other parameters I wanted to run over with you before I go ahead and make changes.Now here comes the dilemma for me, we run memory intensive processes in morning using easily resident memory of 4-10G of memory for processing. I thought of using work_mem =1Gb so that should commit to around 200GB of RAM. and then there are other processes which uses temporary tables in which around 200K records are stored in temporary tablesA temp_mem setting of 1GB sounds good, but i am slightly worried if it could delay the process for those memory extensive processes.Planning on getting shared buffers down to 60GB from 80GB.OS:vm.dirty_background_ratio = 10In regards to vm.dirty_background_ratio, what number do you think will be reasonable for such kind of scenarios?Thanks,Dhruv--On Mon, Aug 25, 2014 at 12:53 PM, Kevin Grittner <kgrittn@xxxxxxxxx> wrote:
Dhruv Shukla <dhruvshukla82@xxxxxxxxx> wrote:
> Kevin Grittner <kgrittn@xxxxxxxxx> wrote:
>> Dhruv Shukla <dhruvshukla82@xxxxxxxxx> wrote:
>>> Other informational details about configurations are:
>>> shared_buffers = 80GB
>>> temp_buffers = 2GB
>>> work_mem = 2GB
>>> maintenance_work_mem = 16GB
>> How much RAM is on the machine (or VM)?
> Currently we have a max connection setting for 1000 connections.The temp_buffers setting is a limit *per connection*, and once RAM
for a temporary table is allocated to a connection it is never
released; so even when all connections are idle they could be
setting aside 2TG of RAM for possible use for caching temporary
tables.
The work_mem setting is an attempt to limit RAM per node of an
executing query; one connection can create many allocations of the
size set for work_mem. Since not all queries have nodes that
require such allocations, and not all connections are necessarily
active at the same time, a general rule of thumb is to allow for
one work_mem allocation per connection allowed by max_connections.
So these settings can easily result in another 2TB of allocations,
beyond the temp_buffers mentioned above.
A high shared_buffers setting can result in "write gluts" at the OS
level when a high percentage of that memory becomes dirty and is
dumped to the OS in a short time. This can result in the OS
appearing to "lock up" for as long as it takes it to flush all of
the dirty data to disk. I don't know what your system is capable
of, but most database servers I've seen fall between 20MB/second
and 200MB/second. That would correspond to apparent stalls lasting
between 6.8 minutes and 1.1 hours. By tweaking the bgwriter
settings and the OS dirty settings you can drastically reduce this,
but I don't think you've indicated having done that, so 80GB can be
expected to cause apparent lockups of those durations.
A high shared_buffers setting makes you more vulnerable to long
stalls because of transparent huge page operations of the OS.
And overcommitting RAM by a factor of more than 10x is not going to
> And RAM on server is 384GB RAM.
be pretty in any event.
If I could not reduce max_connections, I would set work_mem to no
more than 100MB and temp_buffers to no more than 50MB. I would
drop maintenance_work_mem to 2GB. I would probably drastically
reduce shared_buffers and would most certainly make autovacuum and
bgwriter more aggressive than the default.
If you make those changes and still see a problem, only then is it
worth looking at other possible causes.
Regards
Dhruv
404-551-2578
Regards
Dhruv
404-551-2578