Re: Are random writes optimized sequentially by Linux kernel?

"M. Edward (Ed) Borasky" <znmeb@xxxxxxxxxxx> · Wed, 07 Jan 2009 19:07:31 -0800

david@xxxxxxx wrote:
> On Thu, 8 Jan 2009, Dmitry Koterov wrote:
> 
>> OK, thank you.
>>
>> Now - PostgreSQL-related question. If the system reorders writes to
>> minimize
>> seeking, I suppose that in heavy write-loaded PostgreSQL instalation
>> dstat
>> (or iostat) realtime write statistics should be close to the maximum
>> possible value reported by bonnie++ (or simple dd) utility.
> 
> this is not the case for a couple of reasons
> 
> 1. bonnie++ and dd tend to write in one area, so seeks are not as big a
> factor as writing across multiple areas
> 
> 2. postgres doesn't do the simple writes like you described earlier
> 
> it does something like
> 
> write 123-124-fsync-586-354-257-fsync-123-124-125-fsync
> 
> (writes to the WAL journal, syncs it to make sure it's safe, then writes
> to the destinations, the n syncs, then updates the WAL to record that
> it's written....)
> 
> the fsync basicly tells the system, don't write anything more until
> these are done. and interrupts the nice write pattern.
> 
> you can address this by having large battery-backed caches that you
> write to and they batch things out to disk more efficiantly.
> 
> or you can put your WAL on a seperate drive so that the syncs on that
> don't affect the data drives (but you will still have syncs on the data
> disks, just not as many of them)
> 
> David Lang
> 

1. There are four Linux I/O schedulers to choose from in the 2.6 kernel.
If you *aren't* on the 2.6 kernel, give me a shout when you are. :)

2. You can choose the scheduler in use "on the fly". This means you can
set up a benchmark of your *real-world* application, and run it four
times, once with each scheduler, *without* having to reboot or any of
that nonsense. That said, you will probably want to introduce some kind
of "page cache poisoning" technique between these runs to force your
benchmark to deal with every block of data at least once off the hard drive.

3. As I learned a few weeks ago, even simple 160 GB single SATA drives
now have some kind of scheduling algorithm built in, so your tests may
not show significant differences between the four schedulers. This is
even more the case for high-end SANs. You simply must test with your
real workload, rather than using bonnie++, iozone, or fio, to make an
intelligent scheduler choice.

4. For those that absolutely need fine-grained optimization, there is an
open-source tool called "blktrace" that is essentially a "sniffer for
I/O". It is maintained by Jens Axboe of Oracle, who also maintains the
Linux block I/O layer! There is a "driver" called "seekwatcher", also
open source and maintained by Chris Mason of Oracle, that will give you
visualizations of the "blktrace" results. In any event, if you need to
know, you can find out exactly what the scheduler is doing block by
block with "blktrace".

You can track all of this magic down via Google. If there's enough
interest and I have some free cycles, I'll post an extended "howto" on
doing this. But it only took me a week or so to figure it out from
scratch, and the documentation on "seekwatcher" and "blktrace" is
excellent.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance