On Fri, Jun 13, 2014 at 1:02 PM, Jaco Engelbrecht <jengelbrecht@xxxxxxxxxxxxx> wrote: > hi Scott, > > On 13 June 2014 00:30, Scott Marlowe <scott.marlowe@xxxxxxxxx> wrote: >> Just the db and not the server right? Have you tried killing any >> individual queries that seem to be problem / long running queries? > > We have tried to kill individual queries before and it did not have any effect. Try killing them in groups. Do your various servers, doing different jobs, connect with different usernames? For instance backups use one name, etc. If so try killing the ones by a certain usename and see if the problem resolves after that. Sorry for how hand-wavy that answer is but I'm not sitting on a console watching, so I'm not sure what you're seeing. If the apps all connect as the same usename then just try killing the oldest, longest running queries first and see if you can get it back on its feet that way. >>> During this performance issue, we found the following symptoms. >>> Running queries do not return. >> >> Even simple ones like "select 1;" ? >> Or ones that are something like "select * from smalltable where id=123; ? > > SELECTs against pg_stat_activity, pg_stat_user_tables do work. > > Haven't done a SELECT against the affected database during an outage > for a while now, will add that to our script to do next time. I can't > remember offhand either. Worth a look. Just to see how bad it is. >> How much memory does your machine have? How much of that is shared buf >> then? If it's got 8G and 6G shared_buffers I'd lower shared_buffers. >> etc > > 256GB RAM - FWIW, we used to have shared_buffers set to 8G but found > two mailing list posts that seemed to suggest lowering this value > fixed similar issues at those sites. So we've lowered it to 6G. > Still experienced two incidents since. 6G is still really high, esp if you write a lot. The more you write the LOWER you want shared buffers, not higher. We have machines with 1TB RAM, we mostly read, and we still only have something like 2GB shared_buffers. >> What is your IO subsystem like? Is it a single 7200 RPM SATA drive, an >> array of 16 15krpm disks under a caching raid controller? A 1TB >> FusionIO card? > > A Dell PERC H700 RAID card, utilizing the CacheCade functionality (SSD > read-cache) for PostgreSQL data only. > > 2x 146GB 15k SAS disks in RAID 1 for OS > 2x 50GB SSD disks in RAID 1 for CacheCade > 2x 146GB 15k SAS disks in RAID 1 for PostgreSQL transaction logs > 10x 146GB 15k SAS disks in RAID1+0 for PostgreSQL data OK good, then it's not likely you're getting hammered by the IO subsystem. >> Just because linux SAYS there's wait doesn't mean there isn't one. >> Sometimes the io wait numbers are zero while some kernel daemon uses >> up all the bandwidth and you really are waiting. > > OK - any ideas on how to further investigate/validate this? No it looks like IO isn't really the issue. > Having said that, before an incident (again this morning), I can see > high disk latency (23,000 ms) on /pg_data disks (sdc) (and the graph > of temp_files is now empty during incidents). OK so IO is getting overwhelmed slowly but it doesn't look like IO is purely the issue. What I think is happening is that you're getting too many heavy connections at once, slowly bringing down performance, and then it just eventually overwhelms the box with too many connections trying to do too much at once. > This coincides with a checkpoint: > > 2014-06-13 08:13:49 GMT [81383]: [224-1] LOG: checkpoint complete: > wrote 11065 buffers (1.4%); 0 transaction log file(s) added, 0 > removed, 56 recycled; write=789.974 s, sync=9.996 s, total=799.987 s; > sync files=644, longest=2.055 s, average=0.015 s > > iostat -mx 1 2 output: > > Fri Jun 13 08:13:44 UTC 2014 > avg-cpu: %user %nice %system %iowait %steal %idle > 20.31 0.00 3.13 2.21 0.00 74.36 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdb 0.00 501.00 0.00 275.00 0.00 2.68 > 19.99 1.30 4.72 3.30 90.80 > sdc 0.00 9275.00 0.00 12141.00 0.00 83.52 > 14.09 142.56 11.75 0.08 99.70 I assume this is during a checkpoint and this is after: > Fri Jun 13 08:13:52 UTC 2014 > avg-cpu: %user %nice %system %iowait %steal %idle > 19.10 0.00 2.99 0.04 0.00 77.87 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdb 0.00 226.00 0.00 264.00 0.00 1.40 > 10.85 0.01 0.05 0.05 1.40 > sdc 0.00 744.00 0.00 302.00 0.00 4.09 > 27.71 0.03 0.10 0.10 3.00 > > Any thoughts on that? Looks normal if that's during / after a checkpoint. Lowering checkpoint completion target may result in less checkpoint churn but more io used by the bgwriter. > Also, at the moment we're using ext4 as the FS for PostgreSQL. We > were looking to switch to XFS as part of our upgrade to PostgreSQL > 9.3. What's your thoughts on this? I've had pretty good luck with ext4. I doubt it's the problem here. XFS is a good file system. and worth looking into but I don't expect it to fix this issue. >>> Once transactions stop returning, we see connections pile-up. Eventually, we reach a max, and clients can no longer connect. >> >> You need to lower your max connections and institute connection >> pooling yesterday. 500 connections and 256MB work mem are a recipe for >> disaster. While a machine with a db pool in front of it can survive >> such scenarios, lack of a pool and high work mem are killing your >> machine. > > We have only 360 (max potential) connections coming to the database > from our application servers (see below for pgbouncer configuration). > And then a few more connections from our monitoring hosts. In fact, > looking at the aggregate pgbouncer active server connections we're > only peaking at 120 connections under normal load conditions. During > an incident I can see it ramp up and each client using all of their > available server connections, going up to a total of ~360 connections. > > Given the fact that we have 256GB RAM in our server, is your statement > made about 500 connections @ 256MB work_mem still of a concern? Even > say 400 connections at a work_mem size of 256MB? Why would we not > want to use all of our RAM? That's WAY TOO many connections. Assuming you've got 16 to 32 cores, you should be limiting inbound total connections to 2x that max. If you benchmark your app against your db (you can use pgbench with custom scripts to get something similar to your prod load) you should get a graph that starts low throughput at 1 conn, builds up through somewhere around 1x to 2x your cores, an then drops off after that. The shape of the tail of that curve tells you how your db will behave in overload. If it drops off real fast then limiting incoming connections becomes more important. If it drops off real slow and linearly then it's not as big of a deal. Either way 400 connections is way too high. Limit your app servers to something smaller so you get your inbound connections down to 1x to 2x cores. Sometimes running a dedicated pgbouncer box (or pair etc) will allow you to limit these connections without starving any one box. You can also setup various pools for different jobs so each one is limited in a different way. A read only client can probably have a 100 inbound and not mess up the box if it isn't reading and sorting megabytes of data etc. Lastly, if your dataset is bigger than ram, you WILL be using all your RAM even with small shared_buffers and work_mem because the OS will be using that spare memory to cache. The OS is VERY good at caching large chunks of data, and letting it use that spare memory is usually a good thing. OTOH, if you've got a 20G db and 256G RAM Then go for it. Leave enough spare memory to always have the db cached and you're good. But if you're using pgbouncer wisely, and limiting incoming to say 40 connections instead of 400 then you can't overload your box with 256M work_mem because the max is 10 or 20GIg or so. With 400 inbound connections your sorts may be taking up spare ram and forcing the OS to evict it's own caches etc, slowing things down even more since you now have to go to drive to get all your data, not kernel fs cache. > My calculation shows that we're only using: > > shared_buffers + (max_connections * work_mem) + > (autovacuum_max_workers * maintenance_work_mem) > = 8.5GB + (500*256)MB + (6*1)GB > = 8.5GB + 128G + 6GB > = 142GB RAM. That's still a LOT of memory. But you're probably not using it all. >> I recommend pgbouncer. Dirt simple, fast, and will keep your incoming >> connections limited to something your db can handle. > > We already use pgbouncer on our application servers with these > configuration settings: > > pool_mode = transaction > default_pool_size = 20 > max_client_conn = 125 > query_wait_timeout = 2 > server_idle_timeout = 60 So you must have a lot of app servers to be seeing 400 inbound connections. 20 or more? Each one allowing 20 connections is a LOT. Let's say your peak throughput is at 32 connections. This is pretty common with a 16 core machine with the number of drives you have. Let's say that you can do 500 of your heaviest transactions per second there, and that it goes down enough so that by the time you get to 64 connections, you're at 50% that, or 250. Given that slope, by the time you get to 512 connections you're handling 20 or 30 transactions per second. Getting that number down to something closer to your best throughput number is gonna make a huge difference. Keep in mind the other queries on the app servers will just be in a wait state, but your db server will still be clearing transactions at a much higher rate than if it's being overloaded. -- To understand recursion, one must first understand recursion.