Re: Spurious Stalls

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Fri, 13 Jun 2014 14:10:19 -0600

On Fri, Jun 13, 2014 at 1:02 PM, Jaco Engelbrecht
<jengelbrecht@xxxxxxxxxxxxx> wrote:
> hi Scott,
>
> On 13 June 2014 00:30, Scott Marlowe <scott.marlowe@xxxxxxxxx> wrote:
>> Just the db and not the server right? Have you tried killing any
>> individual queries that seem to be problem / long running queries?
>
> We have tried to kill individual queries before and it did not have any effect.

Try killing them in groups. Do your various servers, doing different
jobs, connect with different usernames? For instance backups use one
name, etc. If so try killing the ones by a certain usename and see if
the problem resolves after that. Sorry for how hand-wavy that answer
is but I'm not sitting on a console watching, so I'm not sure what
you're seeing. If the apps all connect as the same usename then just
try killing the oldest, longest running queries first and see if you
can get it back on its feet that way.

>>> During this performance issue, we found the following symptoms.
>>> Running queries do not return.
>>
>> Even simple ones like "select 1;" ?
>> Or ones that are something like "select * from smalltable where id=123; ?
>
> SELECTs against pg_stat_activity, pg_stat_user_tables do work.
>
> Haven't done a SELECT against the affected database during an outage
> for a while now, will add that to our script to do next time.  I can't
> remember offhand either.

Worth a look. Just to see how bad it is.

>> How much memory does your machine have? How much of that is shared buf
>> then? If it's got 8G and 6G shared_buffers I'd lower shared_buffers.
>> etc
>
> 256GB RAM - FWIW, we used to have shared_buffers set to 8G but found
> two mailing list posts that seemed to suggest lowering this value
> fixed similar issues at those sites.  So we've lowered it to 6G.
> Still experienced two incidents since.

6G is still really high, esp if you write a lot. The more you write
the LOWER you want shared buffers, not higher. We have machines with
1TB RAM, we mostly read, and we still only have something like 2GB
shared_buffers.

>> What is your IO subsystem like? Is it a single 7200 RPM SATA drive, an
>> array of 16 15krpm disks under a caching raid controller? A 1TB
>> FusionIO card?
>
> A Dell PERC H700 RAID card, utilizing the CacheCade functionality (SSD
> read-cache) for PostgreSQL data only.
>
> 2x 146GB 15k SAS disks in RAID 1 for OS
> 2x 50GB SSD disks in RAID 1 for CacheCade
> 2x 146GB 15k SAS disks in RAID 1 for PostgreSQL transaction logs
> 10x 146GB 15k SAS disks in RAID1+0 for PostgreSQL data

OK good, then it's not likely you're getting hammered by the IO subsystem.

>> Just because linux SAYS there's wait doesn't mean there isn't one.
>> Sometimes the io wait numbers are zero while some kernel daemon uses
>> up all the bandwidth and you really are waiting.
>
> OK - any ideas on how to further investigate/validate this?

No it looks like IO isn't really the issue.

> Having said that, before an incident (again this morning), I can see
> high disk latency (23,000 ms) on /pg_data disks (sdc) (and the graph
> of temp_files is now empty during incidents).

OK so IO is getting overwhelmed slowly but it doesn't look like IO is
purely the issue.

What I think is happening is that you're getting too many heavy
connections at once, slowly bringing down performance, and then it
just eventually overwhelms the box with too many connections trying to
do too much at once.

> This coincides with a checkpoint:
>
> 2014-06-13 08:13:49 GMT [81383]: [224-1] LOG:  checkpoint complete:
> wrote 11065 buffers (1.4%); 0 transaction log file(s) added, 0
> removed, 56 recycled; write=789.974 s, sync=9.996 s, total=799.987 s;
> sync files=644, longest=2.055 s, average=0.015 s
>
> iostat -mx 1 2 output:
>
> Fri Jun 13 08:13:44 UTC 2014
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           20.31    0.00    3.13    2.21    0.00   74.36
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00   501.00    0.00  275.00     0.00     2.68
> 19.99     1.30    4.72   3.30  90.80
> sdc               0.00  9275.00    0.00 12141.00     0.00    83.52
> 14.09   142.56   11.75   0.08  99.70

I assume this is during a checkpoint

and this is after:

> Fri Jun 13 08:13:52 UTC 2014
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           19.10    0.00    2.99    0.04    0.00   77.87
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00   226.00    0.00  264.00     0.00     1.40
> 10.85     0.01    0.05   0.05   1.40
> sdc               0.00   744.00    0.00  302.00     0.00     4.09
> 27.71     0.03    0.10   0.10   3.00
>
> Any thoughts on that?

Looks normal if that's during / after a checkpoint. Lowering
checkpoint completion target may result in less checkpoint churn but
more io used by the bgwriter.

> Also, at the moment we're using ext4 as the FS for PostgreSQL.  We
> were looking to switch to XFS as part of our upgrade to PostgreSQL
> 9.3.  What's your thoughts on this?

I've had pretty good luck with ext4. I doubt it's the problem here.
XFS is a good file system. and worth looking into but I don't expect
it to fix this issue.

>>> Once transactions stop returning, we see connections pile-up.  Eventually, we reach a max, and clients can no longer connect.
>>
>> You need to lower your max connections and institute connection
>> pooling yesterday. 500 connections and 256MB work mem are a recipe for
>> disaster. While a machine with a db pool in front of it can survive
>> such scenarios, lack of a pool and high work mem are killing your
>> machine.
>
> We have only 360 (max potential) connections coming to the database
> from our application servers (see below for pgbouncer configuration).
> And then a few more connections from our monitoring hosts.  In fact,
> looking at the aggregate pgbouncer active server connections we're
> only peaking at 120 connections under normal load conditions.  During
> an incident I can see it ramp up and each client using all of their
> available server connections, going up to a total of ~360 connections.
>
> Given the fact that we have 256GB RAM in our server, is your statement
> made about 500 connections @ 256MB work_mem still of a concern?  Even
> say 400 connections at a work_mem size of 256MB?  Why would we not
> want to use all of our RAM?

That's WAY TOO many connections. Assuming you've got 16 to 32 cores,
you should be limiting inbound total connections to 2x that max. If
you benchmark your app against your db (you can use pgbench with
custom scripts to get something similar to your prod load) you should
get a graph that starts low throughput at 1 conn, builds up through
somewhere around 1x to 2x your cores, an then drops off after that.
The shape of the tail of that curve tells you how your db will behave
in overload. If it drops off real fast then limiting incoming
connections becomes more important. If it drops off real slow and
linearly then it's not as big of a deal.

Either way 400 connections is way too high. Limit your app servers to
something smaller so you get your inbound connections down to 1x to 2x
cores. Sometimes running a dedicated pgbouncer box (or pair etc) will
allow you to limit these connections without starving any one box. You
can also setup various pools for different jobs so each one is limited
in a different way. A read only client can probably have a 100 inbound
and not mess up the box if it isn't reading and sorting megabytes of
data etc.

Lastly, if your dataset is bigger than ram, you WILL be using all your
RAM even with small shared_buffers and work_mem because the OS will be
using that spare memory to cache. The OS is VERY good at caching large
chunks of data, and letting it use that spare memory is usually a good
thing. OTOH, if you've got a 20G db and 256G RAM Then go for it. Leave
enough spare memory to always have the db cached and you're good.

But if you're using pgbouncer wisely, and limiting incoming to say 40
connections instead of 400 then you can't overload your box with 256M
work_mem because the max is 10 or 20GIg or so. With 400 inbound
connections your sorts may be taking up spare ram and forcing the OS
to evict it's own caches etc, slowing things down even more since you
now have to go to drive to get all your data, not kernel fs cache.

> My calculation shows that we're only using:
>
> shared_buffers + (max_connections * work_mem) +
> (autovacuum_max_workers * maintenance_work_mem)
> = 8.5GB + (500*256)MB + (6*1)GB
> = 8.5GB + 128G + 6GB
> = 142GB RAM.

That's still a LOT of memory. But you're probably not using it all.

>> I recommend pgbouncer. Dirt simple, fast, and will keep your incoming
>> connections limited to something your db can handle.
>
> We already use pgbouncer on our application servers with these
> configuration settings:
>
> pool_mode = transaction
> default_pool_size = 20
> max_client_conn = 125
> query_wait_timeout = 2
> server_idle_timeout = 60

So you must have a lot of app servers to be seeing 400 inbound
connections. 20 or more? Each one allowing 20 connections is a LOT.

Let's say your peak throughput is at 32 connections. This is pretty
common with a 16 core machine with the number of drives you have.
Let's say that you can do 500 of your heaviest transactions per second
there, and that it goes down enough so that by the time you get to 64
connections, you're at 50% that, or 250. Given that slope, by the time
you get to 512 connections you're handling 20 or 30 transactions per
second.

Getting that number down to something closer to your best throughput
number is gonna make a huge difference. Keep in mind the other queries
on the app servers will just be in a wait state, but your db server
will still be clearing transactions at a much higher rate than if it's
being overloaded.
-- 

To understand recursion, one must first understand recursion.