Re: sniff test on some PG 8.4 numbers

Greg Smith <greg@xxxxxxxxxxxxxxx> · Mon, 11 Mar 2013 00:28:18 -0400

On 3/10/13 9:18 PM, Jon Nelson wrote:

The following is with ext4, nobarrier, and noatime. As noted in the
original post, I have done a fair bit of system tuning. I have the
dirty_bytes and dirty_background_bytes set to 3GB and 2GB,
respectively.

That's good, but be aware those values are still essentially unlimited 
write caches.  A server with 4 good but regular hard drives might do as 
little as 10MB/s of random writes on a real workload.  If 2GB of data 
ends up dirty, the flushing that happens at the end of a database 
checkpoint will need to clear all of that out of RAM.  When that 
happens, you're looking at a 3 minute long cache flush to push out 2GB. 
 It's not unusual for pgbench tests to pause for over a minute straight 
when that happens.  With your setup, where checkpoints happen every 5 
minutes, this is only happening once per test run.  The disruption isn't 
easily visible if you look at the average rate; it's outweighed by the 
periods where writes happen very fast because the cache isn't full yet. 
 You have to get pgbench to plot latency over time to see them and then 
analyze that data.  This problem is the main reason I put together the 
pgbench-tools set for running things, because once you get to processing 
the latency files and make graphs from them it starts to be a pain to 
look at the results.

I built 9.2 and using 9.2 and the following pgbench invocation:

pgbench  -j 8  -c 32 -M prepared -T 600

transaction type: TPC-B (sort of)
scaling factor: 400

I misread this completely in your message before; I thought you wrote 
4000.  A scaling factor of 400 is making a database that's 6GB in size. 
 Your test is basically seeing how fast the system memory and the RAID 
cache can move things around.  In that situation, your read and write 
numbers are reasonable.  They aren't actually telling you anything 
useful about the disks though, because they're barely involved here. 
You've sniffed the CPU, memory, and RAID controller and they smell fine. 
 You'll need at least an order of magnitude increase in scale to get a 
whiff of the disks.

pgbench scale numbers give approximately 16MB per scale factor.  You 
don't actually stress the drives until that total number is at least 2X 
as big as RAM.  We had to raise the limit on the pgbench scales recently 
because it only goes up to ~20,000 on earlier versions, and that's not a 
big enough scale to test many servers now.

On the select-only tests, much of the increase from ~100K to ~200K is 
probably going from 8.4 to 9.2.  There's two major and several minor 
tuning changes that make it much more efficient at that specific task.

These are the *only* changes I've made to the config file:

shared_buffers = 32GB
wal_buffers = 16MB
checkpoint_segments = 1024

Note that these are the only changes that actually impact pgbench 
results.  The test doesn't stress very many parts of the system, such as 
the query optimizer.

Also be aware these values may not be practical to use in production. 
You can expect bad latency issues due to having shared_buffers so large. 
 All that memory has to be reconciled and written to disk if it's been 
modified at each checkpoint, and 32GB of such work is a lot.  I have 
systems where we can't make shared_buffers any bigger than 4GB before 
checkpoint pauses get too bad.

Similarly, setting checkpoint_segments to 1024 means that you might go 
through 16GB of writes before a checkpoint happens.  That's great for 
average performance...but when that checkpoint does hit, you're facing a 
large random I/O backlog.

There's not much you can do about all this on the Linux side.  If you 
drop the dirty_* parameters too much, maintenance operations like VACUUM 
start to get slow.  Really all you can do is avoid setting 
shared_buffers and checkpoint_segments too high, so the checkpoint 
backlog never gets gigantic.  The tuning you've done is using higher 
values than we normally recommend because it's not quite practical to 
deploy like that.  That and the very small database are probably why 
your numbers are so high.

Note: I did get better results with HT on vs. with HT off, so I've
left HT on for now.

pgbench select-only in particular does like hyper-threading.  We get 
occasional reports of more memory-bound workloads actually slowing when 
it's turned on.  I think it's a wash and leave it on.  Purchasing and 
management people tend to get annoyed if they discover the core count of 
the server is half what they thought they were buying.  The potential 
downside of HT isn't so big that its worth opening that can of worms, 
unless you've run real application level tests to prove it hurts.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance