Re: sniff test on some PG 8.4 numbers

Jon Nelson <jnelson+pgsql@xxxxxxxxxxx> · Mon, 11 Mar 2013 09:47:06 -0500

On Sun, Mar 10, 2013 at 11:28 PM, Greg Smith <greg@xxxxxxxxxxxxxxx> wrote:
> On 3/10/13 9:18 PM, Jon Nelson wrote:
>
>> The following is with ext4, nobarrier, and noatime. As noted in the
>> original post, I have done a fair bit of system tuning. I have the
>> dirty_bytes and dirty_background_bytes set to 3GB and 2GB,
>> respectively.
>
>
> That's good, but be aware those values are still essentially unlimited write
> caches.  A server with 4 good but regular hard drives might do as little as
> 10MB/s of random writes on a real workload.  If 2GB of data ends up dirty,
> the flushing that happens at the end of a database checkpoint will need to
> clear all of that out of RAM.  When that happens, you're looking at a 3
> minute long cache flush to push out 2GB.  It's not unusual for pgbench tests
> to pause for over a minute straight when that happens.  With your setup,
> where checkpoints happen every 5 minutes, this is only happening once per
> test run.  The disruption isn't easily visible if you look at the average
> rate; it's outweighed by the periods where writes happen very fast because
> the cache isn't full yet.  You have to get pgbench to plot latency over time
> to see them and then analyze that data.  This problem is the main reason I
> put together the pgbench-tools set for running things, because once you get
> to processing the latency files and make graphs from them it starts to be a
> pain to look at the results.

I'll try to find time for this, but it may need to wait until the weekend again.

>> I built 9.2 and using 9.2 and the following pgbench invocation:
>>
>> pgbench  -j 8  -c 32 -M prepared -T 600
>>
>> transaction type: TPC-B (sort of)
>> scaling factor: 400
>
>
> I misread this completely in your message before; I thought you wrote 4000.
> A scaling factor of 400 is making a database that's 6GB in size.  Your test
> is basically seeing how fast the system memory and the RAID cache can move
> things around.  In that situation, your read and write numbers are
> reasonable.  They aren't actually telling you anything useful about the
> disks though, because they're barely involved here. You've sniffed the CPU,
> memory, and RAID controller and they smell fine.  You'll need at least an
> order of magnitude increase in scale to get a whiff of the disks.

LOL! Your phrasing is humourous and the information useful.

I ran for 8.0 hours and go this:

transaction type: TPC-B (sort of)
scaling factor: 400
query mode: prepared
number of clients: 32
number of threads: 8
duration: 28800 s
number of transactions actually processed: 609250619
tps = 21154.058025 (including connections establishing)
tps = 21154.075922 (excluding connections establishing)

> pgbench scale numbers give approximately 16MB per scale factor.  You don't
> actually stress the drives until that total number is at least 2X as big as
> RAM.  We had to raise the limit on the pgbench scales recently because it
> only goes up to ~20,000 on earlier versions, and that's not a big enough
> scale to test many servers now.
>
> On the select-only tests, much of the increase from ~100K to ~200K is
> probably going from 8.4 to 9.2.  There's two major and several minor tuning
> changes that make it much more efficient at that specific task.
>
>
>> These are the *only* changes I've made to the config file:
>>
>> shared_buffers = 32GB
>> wal_buffers = 16MB
>> checkpoint_segments = 1024
>
>
> Note that these are the only changes that actually impact pgbench results.
> The test doesn't stress very many parts of the system, such as the query
> optimizer.
>
> Also be aware these values may not be practical to use in production. You
> can expect bad latency issues due to having shared_buffers so large.  All
> that memory has to be reconciled and written to disk if it's been modified
> at each checkpoint, and 32GB of such work is a lot.  I have systems where we
> can't make shared_buffers any bigger than 4GB before checkpoint pauses get
> too bad.
>
> Similarly, setting checkpoint_segments to 1024 means that you might go
> through 16GB of writes before a checkpoint happens.  That's great for
> average performance...but when that checkpoint does hit, you're facing a
> large random I/O backlog.

I thought the bgwriter mitigated most of the problems here? Often I'll
see the actual checkpoints with 'sync' times typically below a few
seconds (when there is anything to do at all). I can't say I've seen
checkpoint pauses in my workloads.

> There's not much you can do about all this on the Linux side.  If you drop
> the dirty_* parameters too much, maintenance operations like VACUUM start to
> get slow.  Really all you can do is avoid setting shared_buffers and
> checkpoint_segments too high, so the checkpoint backlog never gets gigantic.
> The tuning you've done is using higher values than we normally recommend
> because it's not quite practical to deploy like that.  That and the very
> small database are probably why your numbers are so high.

Mostly I do data warehouse type of workloads with very little (if any)
data modification after initial load time. Extensive benchmarking of
the actual applications involved has shown that - for me - a large
(but not too large) shared_buffers (32GB is right about the sweet spot
for me, perhaps a bit on the high side) works well. Additionally, the
large checkpoint_segments value really appears to help as well (again,
this is very workload dependent).

>> Note: I did get better results with HT on vs. with HT off, so I've
>> left HT on for now.
>
>
> pgbench select-only in particular does like hyper-threading.  We get
> occasional reports of more memory-bound workloads actually slowing when it's
> turned on.  I think it's a wash and leave it on.  Purchasing and management
> people tend to get annoyed if they discover the core count of the server is
> half what they thought they were buying.  The potential downside of HT isn't
> so big that its worth opening that can of worms, unless you've run real
> application level tests to prove it hurts.

Glad to get an "it's a wash" confirmation here.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance