Greg Smith, 09.06.2010 07:14:
I probably should have explained the next part. I've now shared what I do with this information at http://www.pgcon.org/2010/schedule/events/218.en.html Basically, if you put the data from the two snapshots into one of the Statistics Spreadsheet versions, you'll get several derived numbers that pop out: -Average checkpoint frequency -Average size of each checkpoint -Average rate at which new buffers are allocated -Average rate of writes out of the buffer cache -Percentage of writes done by checkpoints, the background writer LRU cleaner, and client backends These are the sort of things you can actually think about in useful real-world terms. And if you tune the database by doing things like increasing checkpoint_segments/checkpoint_timeout or changing the size of shared_buffers, you can see what impact that has on the overall system response, by noting how it changes these numeric measures. It's a bit more exact of an approach for tuning such things than what people have traditionally done here. Increasing shared_buffers and watching the total writes/second number drop significantly is more exact than just nothing whether the server waiting for I/O percentage dropped or not. I also highly recommend installing pg_buffercache and looking at what's inside your buffer cache too--the data actually being cached by the server never fails to surprise people, relative to what they expected was being cached. Source for how to analyze those things is in the pg_buffercache samples shell script, the "Buffer contents summary, with percentages" is the one that gives useful output. One of the thing I did in the presentation that does not come through on the slides is note how the distribution of high usage count data in your buffer cache determines whether you will benefit from making it large or not. I've reduced this to a single figure of merit now: "cumulative % of buffers that have a usage count of >=2". On a system that benefits from having a big buffer cache, that number will be quite large (86% in the first of the samples in the spreadsheet). On one that doesn't, it will be small (35% on the last system listed there). You really have to tune those two types of workload differently.
Thanks a lot for this detailed information! I don't think we have a caching issue though, it's just one (regular) bulk import that does some heavy writes which we need to tune. Read performance is absolutely fine, but everything around vacuum and checkpointing seems to slow down the import massively. So would taking a snapshot before and after this import be the right way to start the analysis? Regards Thomas -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin