On Sun, Mar 10, 2013 at 11:28 PM, Greg Smith <greg@xxxxxxxxxxxxxxx> wrote: > On 3/10/13 9:18 PM, Jon Nelson wrote: > >> The following is with ext4, nobarrier, and noatime. As noted in the >> original post, I have done a fair bit of system tuning. I have the >> dirty_bytes and dirty_background_bytes set to 3GB and 2GB, >> respectively. > > > That's good, but be aware those values are still essentially unlimited write > caches. A server with 4 good but regular hard drives might do as little as > 10MB/s of random writes on a real workload. If 2GB of data ends up dirty, > the flushing that happens at the end of a database checkpoint will need to > clear all of that out of RAM. When that happens, you're looking at a 3 > minute long cache flush to push out 2GB. It's not unusual for pgbench tests > to pause for over a minute straight when that happens. With your setup, > where checkpoints happen every 5 minutes, this is only happening once per > test run. The disruption isn't easily visible if you look at the average > rate; it's outweighed by the periods where writes happen very fast because > the cache isn't full yet. You have to get pgbench to plot latency over time > to see them and then analyze that data. This problem is the main reason I > put together the pgbench-tools set for running things, because once you get > to processing the latency files and make graphs from them it starts to be a > pain to look at the results. I'll try to find time for this, but it may need to wait until the weekend again. >> I built 9.2 and using 9.2 and the following pgbench invocation: >> >> pgbench -j 8 -c 32 -M prepared -T 600 >> >> transaction type: TPC-B (sort of) >> scaling factor: 400 > > > I misread this completely in your message before; I thought you wrote 4000. > A scaling factor of 400 is making a database that's 6GB in size. Your test > is basically seeing how fast the system memory and the RAID cache can move > things around. In that situation, your read and write numbers are > reasonable. They aren't actually telling you anything useful about the > disks though, because they're barely involved here. You've sniffed the CPU, > memory, and RAID controller and they smell fine. You'll need at least an > order of magnitude increase in scale to get a whiff of the disks. LOL! Your phrasing is humourous and the information useful. I ran for 8.0 hours and go this: transaction type: TPC-B (sort of) scaling factor: 400 query mode: prepared number of clients: 32 number of threads: 8 duration: 28800 s number of transactions actually processed: 609250619 tps = 21154.058025 (including connections establishing) tps = 21154.075922 (excluding connections establishing) > pgbench scale numbers give approximately 16MB per scale factor. You don't > actually stress the drives until that total number is at least 2X as big as > RAM. We had to raise the limit on the pgbench scales recently because it > only goes up to ~20,000 on earlier versions, and that's not a big enough > scale to test many servers now. > > On the select-only tests, much of the increase from ~100K to ~200K is > probably going from 8.4 to 9.2. There's two major and several minor tuning > changes that make it much more efficient at that specific task. > > >> These are the *only* changes I've made to the config file: >> >> shared_buffers = 32GB >> wal_buffers = 16MB >> checkpoint_segments = 1024 > > > Note that these are the only changes that actually impact pgbench results. > The test doesn't stress very many parts of the system, such as the query > optimizer. > > Also be aware these values may not be practical to use in production. You > can expect bad latency issues due to having shared_buffers so large. All > that memory has to be reconciled and written to disk if it's been modified > at each checkpoint, and 32GB of such work is a lot. I have systems where we > can't make shared_buffers any bigger than 4GB before checkpoint pauses get > too bad. > > Similarly, setting checkpoint_segments to 1024 means that you might go > through 16GB of writes before a checkpoint happens. That's great for > average performance...but when that checkpoint does hit, you're facing a > large random I/O backlog. I thought the bgwriter mitigated most of the problems here? Often I'll see the actual checkpoints with 'sync' times typically below a few seconds (when there is anything to do at all). I can't say I've seen checkpoint pauses in my workloads. > There's not much you can do about all this on the Linux side. If you drop > the dirty_* parameters too much, maintenance operations like VACUUM start to > get slow. Really all you can do is avoid setting shared_buffers and > checkpoint_segments too high, so the checkpoint backlog never gets gigantic. > The tuning you've done is using higher values than we normally recommend > because it's not quite practical to deploy like that. That and the very > small database are probably why your numbers are so high. Mostly I do data warehouse type of workloads with very little (if any) data modification after initial load time. Extensive benchmarking of the actual applications involved has shown that - for me - a large (but not too large) shared_buffers (32GB is right about the sweet spot for me, perhaps a bit on the high side) works well. Additionally, the large checkpoint_segments value really appears to help as well (again, this is very workload dependent). >> Note: I did get better results with HT on vs. with HT off, so I've >> left HT on for now. > > > pgbench select-only in particular does like hyper-threading. We get > occasional reports of more memory-bound workloads actually slowing when it's > turned on. I think it's a wash and leave it on. Purchasing and management > people tend to get annoyed if they discover the core count of the server is > half what they thought they were buying. The potential downside of HT isn't > so big that its worth opening that can of worms, unless you've run real > application level tests to prove it hurts. Glad to get an "it's a wash" confirmation here. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance