On Fri, 13 Jun 2014 20:02:01 +0100 Jaco Engelbrecht <jengelbrecht@xxxxxxxxxxxxx> wrote: > > This coincides with a checkpoint: There's a lot going on here (as always with a busy server) but I suspected a checkpoint problem earlier, and this statement and your additional information makes me suspect even harder. In your earlier email with the .conf file, I seem to remember that you had checkpoint_segments set to 256 and checkpoint_timeout set to 30m. It's obvious from reading this thread that you've already put quite a bit of effort into resolving this. My question: have you tried _lowering_ the checkpoint settings? If we assume that the stall is related to checkpoint, and I'm remembering correctly on the settings, then PostgreSQL might have as much as 4G of wal logs to grind through to complete a checkpoint. While not huge, if that's trying to complete at the same time a lot of other work is going on, it could cause stalls. If you lower the chckpoint_segments and checkpoint_timeout, it will cause _more_ disk activity overall, but it will be spread out more. Whether or not this helps with your particular situation is dependent on whether your incidents are caused by a spike in activity (in which case it might help) or a cumulative effect of a lot of activity (in which case it will probably make the situation worse). Another thing that I may be misremembering from from your earlier email: did you say that the load on the database was mostly write (or am I misremembering that you said the OS graphs were showing mostly write?) The reason I'm asking is that we've seen problems like you describe when trying to implement a high volume queue in PostgreSQL: the continuous INSERT/SELECT/DELETE grind on the single queue table was just more than PostgreSQL could keep up with. We moved that one portion of the application to Redis and everything else just fell in line. I'm stretching a bit to suppose that you have a similar problem, but it's another data point for you to consider. > > 2014-06-13 08:13:49 GMT [81383]: [224-1] LOG: checkpoint complete: > wrote 11065 buffers (1.4%); 0 transaction log file(s) added, 0 > removed, 56 recycled; write=789.974 s, sync=9.996 s, total=799.987 s; > sync files=644, longest=2.055 s, average=0.015 s > > iostat -mx 1 2 output: > > Fri Jun 13 08:13:44 UTC 2014 > avg-cpu: %user %nice %system %iowait %steal %idle > 20.31 0.00 3.13 2.21 0.00 74.36 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdb 0.00 501.00 0.00 275.00 0.00 2.68 > 19.99 1.30 4.72 3.30 90.80 > sdc 0.00 9275.00 0.00 12141.00 0.00 83.52 > 14.09 142.56 11.75 0.08 99.70 > > .. > > Fri Jun 13 08:13:52 UTC 2014 > avg-cpu: %user %nice %system %iowait %steal %idle > 19.10 0.00 2.99 0.04 0.00 77.87 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdb 0.00 226.00 0.00 264.00 0.00 1.40 > 10.85 0.01 0.05 0.05 1.40 > sdc 0.00 744.00 0.00 302.00 0.00 4.09 > 27.71 0.03 0.10 0.10 3.00 > sda 0.00 623.00 0.00 508.00 0.00 4.02 > 16.22 0.04 0.08 0.06 3.10 > dm-0 0.00 0.00 0.00 1030.00 0.00 4.02 > 8.00 0.11 0.11 0.03 3.30 > dm-1 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > > Any thoughts on that? > > Also, at the moment we're using ext4 as the FS for PostgreSQL. We > were looking to switch to XFS as part of our upgrade to PostgreSQL > 9.3. What's your thoughts on this? > > >> Once transactions stop returning, we see connections pile-up. Eventually, we reach a max, and clients can no longer connect. > > > > You need to lower your max connections and institute connection > > pooling yesterday. 500 connections and 256MB work mem are a recipe for > > disaster. While a machine with a db pool in front of it can survive > > such scenarios, lack of a pool and high work mem are killing your > > machine. > > We have only 360 (max potential) connections coming to the database > from our application servers (see below for pgbouncer configuration). > And then a few more connections from our monitoring hosts. In fact, > looking at the aggregate pgbouncer active server connections we're > only peaking at 120 connections under normal load conditions. During > an incident I can see it ramp up and each client using all of their > available server connections, going up to a total of ~360 connections. > > Given the fact that we have 256GB RAM in our server, is your statement > made about 500 connections @ 256MB work_mem still of a concern? Even > say 400 connections at a work_mem size of 256MB? Why would we not > want to use all of our RAM? > > My calculation shows that we're only using: > > shared_buffers + (max_connections * work_mem) + > (autovacuum_max_workers * maintenance_work_mem) > = 8.5GB + (500*256)MB + (6*1)GB > = 8.5GB + 128G + 6GB > = 142GB RAM. > > > I recommend pgbouncer. Dirt simple, fast, and will keep your incoming > > connections limited to something your db can handle. > > We already use pgbouncer on our application servers with these > configuration settings: > > pool_mode = transaction > default_pool_size = 20 > max_client_conn = 125 > query_wait_timeout = 2 > server_idle_timeout = 60 > > Jaco > > > -- > Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Bill Moran <wmoran@xxxxxxxxxxxxxxxxx>