Re: settings input for upgrade

"Midge Brown" <midgems@xxxxxxxxxxxxx> · Mon, 22 Aug 2011 09:48:40 -0700

Thank you so much for the input, and the detail 
provided. 

I'll be making the configuration changes, 
probably over the course of the week, checking the affect after each (as 
reminded by Scott Marlowe). I was pushed to put the new version into 
production over the weekend, which at least may provide me with some 
accurate feedback, and so will see what happens for a bit before addressing 
the disk/drive layout. 

-Midge

  ----- Original Message ----- 
  From: 
  Greg 
  Smith 
  To: pgsql-performance@xxxxxxxxxxxxxx 

  Sent: Sunday, August 21, 2011 12:20 
  PM
  Subject: Re:  settings input for 
  upgrade

On 08/18/2011 05:55 PM, Midge Brown wrote:

DB1 is 10GB and consists of 
    multiple tables that I've spread out so that the 3 most used have their 
    data and indexes on 6 separate RAID1 drives, the 3 next busiest have data 
    & index on 3 drives, and the remaining tables and indexes are on the 
    RAID10 drive. The WAL for all is on a separate RAID1 drive. 
    DB2 is 25GB with data, index, and WAL all on 
    separate RAID1 drives.
    DB3 is 15GB with data, index, and WAL on 
    separate RAID1 drives.

Anytime you have a set of 
  disks and a set of databases/tables to lay out onto them, there are two main 
  options to consider:

-Put all of them into a single RAID10 array.  
  Performance will be high now matter what subset of the database is being 
  used.  But if one particular part of a database is really busy, it can 
  divert resources away from the rest.

-Break the database into 
  fine-grained pieces and carefully lay out each of them on disk.  
  Performance of any individual chunk will be steady here.  But if only a 
  subset of the data is being used, server resources will be idle.  All of 
  the disks that don't have data related to that will be unused.

Consider 
  two configurations following these ideas:

1) 12 disks are placed into a 
  large RAID10 array.  Peak transfer rate will be about 600MB/s on 
  sequential scans.

2) 6 RAID1 arrays are created and the database is 
  manually laid out onto those disks.  Peak transfer rate from any one 
  section will be closer to 100MB/s.

Each of these is optimizing for a 
  different use scenario.  Here's the best case for each:

-One user 
  is active, and they're hitting one of the database sections.  In setup 
  (1) they might get 600MB/s, the case where it shows the most benefit.  In 
  setup (2), they'd only get 100MB/s.

-10 users are pounding one section 
  of the database; 1 user is hitting a different section.  In setup (2), 
  all 10 users will be fighting over access to one section of the disk, each 
  getting (at best) 10MB/s of its transfers. The nature of random I/O means that 
  it will likely be much worse for them.  Meanwhile, the user hitting the 
  other database section will still be merrily chugging away getting their 
  100MB/s.  Had setup (1) been used, you'd have 11 users fighting over 
  600MB/s, so at best 55MB/s for each.  And with the random mix, it could 
  be much worse.

Which of these is better?  Well, (1) is guaranteed 
  to use your hardware to its fullest capability.  There are some 
  situations where contention over the disk array will cause performance to be 
  lower for some people, compared to if they had an isolated environment split 
  up more like (2).  But the rest of the time, (2) will have taken a large 
  number of disks and left them idle.  The second example shows this really 
  well.  The mere fact that you have such a huge aggregate speed available 
  means that the big array really doesn't necessarily suffer that badly from a 
  heavy load.  It has 6X as much capacity to handle them.  You really 
  need to have a >6:1 misbalance in access before the carefully laid out 
  version pulls ahead.  In every other case, the big array wins.

You 
  can defend (2) as the better choice if you have really compelling, hard data 
  proving use of the various parts of the data is split quite evenly among the 
  expected incoming workload.  If you have response time latency targets 
  that require separating resources evenly among the various types of users, it 
  can also make sense there.  I don't know if the data you've been 
  collecting from your older version is good enough to know that for sure or 
  not.

In every other case, you'd be better off just dumping the whole 
  pile into a single, large array, and letting the array and operating system 
  figure out how to schedule things best.  That why this is the normal 
  practice for building PostgreSQL systems.  The sole exception is that 
  splitting out the pg_xlog filesystem can usually be justified in a larger 
  array.  The fact that it's always sequential I/O means that mixing its 
  work with the rest of the server doesn't work as well as giving it a dedicated 
  pair of drives to write to, where it doesn't ever stop to seek somewhere 
  else.

     wal_buffers = 
    32MB

This 
  might as well drop to 16MB.  And you've already gotten some warnings 
  about work_mem.  Switching to a connection pooler would help with that, 
  too.

     autovacuum_analyze_threshold = 
    250 
     autovacuum_naptime = 
    10min

     autovacuum_vacuum_threshold  
    = 250
     vacuum_cost_delay = 
    10ms

This strikes me as more 
  customization than you really should be doing to autovacuum, if you haven't 
  been running on a recent version of PostgreSQL yet.  You shouldn't 
  ever need to touch the thresholds for example.  Those only matter on 
  really small tables; once something gets big enough to really matter, the 
  threshold part is really small compared to the scale factor one.  And the 
  defaults are picked partly so that cleanup of the system catalog tables is 
  done frequently enough.  You're slowing that cleanup by moving the 
  thresholds upward so much, and that's not a great idea.  

For 
  similar reasons, you really shouldn't be touching autovacuum_naptime unless 
  there's really good evidence it's necessary for your environment.

Changing things such 
  that regular vacuums executed at the command line happen with a cost delay 
  like this should be fine though.  Those will happen using twice as many 
  resources as the autovacuum ones, but not run as fast as possible as in the 
  normal case.

     deadlock_timeout = 
    3s 

You 
  probably don't want to increase this.  When you reach the point where you 
  want to find slow lock issues by turning on log_lock_waits, you're just going 
  to put it right back to the default again--or lower it.

-- 
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us