david@xxxxxxx wrote: > On Thu, 8 Jan 2009, Dmitry Koterov wrote: > >> OK, thank you. >> >> Now - PostgreSQL-related question. If the system reorders writes to >> minimize >> seeking, I suppose that in heavy write-loaded PostgreSQL instalation >> dstat >> (or iostat) realtime write statistics should be close to the maximum >> possible value reported by bonnie++ (or simple dd) utility. > > this is not the case for a couple of reasons > > 1. bonnie++ and dd tend to write in one area, so seeks are not as big a > factor as writing across multiple areas > > 2. postgres doesn't do the simple writes like you described earlier > > it does something like > > write 123-124-fsync-586-354-257-fsync-123-124-125-fsync > > (writes to the WAL journal, syncs it to make sure it's safe, then writes > to the destinations, the n syncs, then updates the WAL to record that > it's written....) > > the fsync basicly tells the system, don't write anything more until > these are done. and interrupts the nice write pattern. > > you can address this by having large battery-backed caches that you > write to and they batch things out to disk more efficiantly. > > or you can put your WAL on a seperate drive so that the syncs on that > don't affect the data drives (but you will still have syncs on the data > disks, just not as many of them) > > David Lang > 1. There are four Linux I/O schedulers to choose from in the 2.6 kernel. If you *aren't* on the 2.6 kernel, give me a shout when you are. :) 2. You can choose the scheduler in use "on the fly". This means you can set up a benchmark of your *real-world* application, and run it four times, once with each scheduler, *without* having to reboot or any of that nonsense. That said, you will probably want to introduce some kind of "page cache poisoning" technique between these runs to force your benchmark to deal with every block of data at least once off the hard drive. 3. As I learned a few weeks ago, even simple 160 GB single SATA drives now have some kind of scheduling algorithm built in, so your tests may not show significant differences between the four schedulers. This is even more the case for high-end SANs. You simply must test with your real workload, rather than using bonnie++, iozone, or fio, to make an intelligent scheduler choice. 4. For those that absolutely need fine-grained optimization, there is an open-source tool called "blktrace" that is essentially a "sniffer for I/O". It is maintained by Jens Axboe of Oracle, who also maintains the Linux block I/O layer! There is a "driver" called "seekwatcher", also open source and maintained by Chris Mason of Oracle, that will give you visualizations of the "blktrace" results. In any event, if you need to know, you can find out exactly what the scheduler is doing block by block with "blktrace". You can track all of this magic down via Google. If there's enough interest and I have some free cycles, I'll post an extended "howto" on doing this. But it only took me a week or so to figure it out from scratch, and the documentation on "seekwatcher" and "blktrace" is excellent. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance