On 3/10/13 9:18 PM, Jon Nelson wrote:
The following is with ext4, nobarrier, and noatime. As noted in the original post, I have done a fair bit of system tuning. I have the dirty_bytes and dirty_background_bytes set to 3GB and 2GB, respectively.
That's good, but be aware those values are still essentially unlimited write caches. A server with 4 good but regular hard drives might do as little as 10MB/s of random writes on a real workload. If 2GB of data ends up dirty, the flushing that happens at the end of a database checkpoint will need to clear all of that out of RAM. When that happens, you're looking at a 3 minute long cache flush to push out 2GB. It's not unusual for pgbench tests to pause for over a minute straight when that happens. With your setup, where checkpoints happen every 5 minutes, this is only happening once per test run. The disruption isn't easily visible if you look at the average rate; it's outweighed by the periods where writes happen very fast because the cache isn't full yet. You have to get pgbench to plot latency over time to see them and then analyze that data. This problem is the main reason I put together the pgbench-tools set for running things, because once you get to processing the latency files and make graphs from them it starts to be a pain to look at the results.
I built 9.2 and using 9.2 and the following pgbench invocation: pgbench -j 8 -c 32 -M prepared -T 600 transaction type: TPC-B (sort of) scaling factor: 400
I misread this completely in your message before; I thought you wrote 4000. A scaling factor of 400 is making a database that's 6GB in size. Your test is basically seeing how fast the system memory and the RAID cache can move things around. In that situation, your read and write numbers are reasonable. They aren't actually telling you anything useful about the disks though, because they're barely involved here. You've sniffed the CPU, memory, and RAID controller and they smell fine. You'll need at least an order of magnitude increase in scale to get a whiff of the disks.
pgbench scale numbers give approximately 16MB per scale factor. You don't actually stress the drives until that total number is at least 2X as big as RAM. We had to raise the limit on the pgbench scales recently because it only goes up to ~20,000 on earlier versions, and that's not a big enough scale to test many servers now.
On the select-only tests, much of the increase from ~100K to ~200K is probably going from 8.4 to 9.2. There's two major and several minor tuning changes that make it much more efficient at that specific task.
These are the *only* changes I've made to the config file: shared_buffers = 32GB wal_buffers = 16MB checkpoint_segments = 1024
Note that these are the only changes that actually impact pgbench results. The test doesn't stress very many parts of the system, such as the query optimizer.
Also be aware these values may not be practical to use in production. You can expect bad latency issues due to having shared_buffers so large. All that memory has to be reconciled and written to disk if it's been modified at each checkpoint, and 32GB of such work is a lot. I have systems where we can't make shared_buffers any bigger than 4GB before checkpoint pauses get too bad.
Similarly, setting checkpoint_segments to 1024 means that you might go through 16GB of writes before a checkpoint happens. That's great for average performance...but when that checkpoint does hit, you're facing a large random I/O backlog.
There's not much you can do about all this on the Linux side. If you drop the dirty_* parameters too much, maintenance operations like VACUUM start to get slow. Really all you can do is avoid setting shared_buffers and checkpoint_segments too high, so the checkpoint backlog never gets gigantic. The tuning you've done is using higher values than we normally recommend because it's not quite practical to deploy like that. That and the very small database are probably why your numbers are so high.
Note: I did get better results with HT on vs. with HT off, so I've left HT on for now.
pgbench select-only in particular does like hyper-threading. We get occasional reports of more memory-bound workloads actually slowing when it's turned on. I think it's a wash and leave it on. Purchasing and management people tend to get annoyed if they discover the core count of the server is half what they thought they were buying. The potential downside of HT isn't so big that its worth opening that can of worms, unless you've run real application level tests to prove it hurts.
-- Greg Smith 2ndQuadrant US greg@xxxxxxxxxxxxxxx Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance