Re: long running commits

"Kevin Grittner" <Kevin.Grittner@xxxxxxxxxxxx> · Wed, 02 Mar 2011 13:04:56 -0600

> "Vaughn, Adam (IMS)" <VaughnA@xxxxxxxxxx> wrote:
> 
> I made all of the changes you mentioned except for the
> shared_buffers (which will require a downtime I have set for
> tonight). I do have another question though, why did you pick 512
> MB for the new setting of shared_buffers? Everything I've ever
> read says that 25% of available RAM is a conservative value for
> shared_buffers.

Well, in general 25% may be the best for overall *throughput*, but
it can often lead to latency spikes, so it depends on what you care
about.  The curve of throughput against shared_buffers has gotten
pretty close to horizontal by around 1GB in a lot of my tests.  It
doesn't, after all, reduce the size of your cache; it affects how
much of the cache access requires an OS call versus staying within
PostgreSQL functions.

> Also, we had another one of these instances earlier today. During
> the 23 minute commit a single CPU was at 98% and it looked like
> all writes were backed up waiting for the commit to finalize.

What other symptoms did you notice?  Did the context switch rate
jump?  If I remember correctly you are going over the network to
your persistent storage -- did you see anything anomalous on the
network around that time?

Greg Smith has really turned the tuning of this into a science, with
incremental adjustments and some specific monitoring techniques.  I
strongly recommend you look for that on the list archives, in
recordings or slides from his presentations on the topice, or his
recent book.

> During the time our writing never got above 25 MB/s (far less than
> we can handle).

At that rate it should have been able to write your 6GB shared
buffers in about 4.1 minutes.  What did the disk writes *average*
during the incident?  6GB in 23 minutes would be 4.5MB per second. 
If your average falls anywhere near that number, you have evidence
that the buffers were mostly dirty for some reason and the
checkpoint glutted on writing them.

> Is it possible that we're missing an index somewhere or there's
> something else going on?

A missing index would probably show high disk reads from sequential
passing large tables.  A context switch storm or network problems
seem like the two most likely causes to me, but it would pay to
monitor anything you can to help pin down the bottleneck next time
this happens.

-Kevin

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin