On Wed, 26 Dec 2007, Mark Mielke wrote:
david@xxxxxxx wrote:
Thanks for the explanation David. It's good to know not only what but also
why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
read: the one with the data and the parity disk?
no, becouse the parity is of the sort (A+B+C+P) mod X = 0
so if X=10 (which means in practice that only the last decimal digit of
anything matters, very convienient for examples)
A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0
if you read B and get 3 and P and get 4 you don't know if this is right or
not unless you also read A and C (at which point you would get
A+B+C+P=11=1=error)
I don't think this is correct. RAID 5 is parity which is XOR. The property of
XOR is such that it doesn't matter what the other drives are. You can write
any block given either: 1) The block you are overwriting and the parity, or
2) all other blocks except for the block we are writing and the parity. Now,
it might be possible that option 2) is taken more than option 1) for some
complicated reasons, but it is NOT to check consistency. The array is assumed
consistent until proven otherwise.
I was being sloppy in explaining the reason, you are correct that for
writes you don't need to read all the data, you just need the current
parity block, the old data you are going to replace, and the new data to
be able to calculate the new parity block (and note that even with my
checksum example this would be the case).
however I was addressing the point that for reads you can't do any
checking until you have read in all the blocks.
if you never check the consistency, how will it ever be proven otherwise.
in theory a system could get the same performance with a large sequential
read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same
number of data disks, ignoring the parity disks) becouse the OS could read
the entire stripe in at once, do the calculation once, and use all the data
(or when writing, don't write anything until you are ready to write the
entire stripe, calculate the parity and write everything once).
For the same number of drives, this cannot be possible. With 10 disks, on
raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum
performance is only 9/10 of the 10/10 performance possible with RAID 0.
I was saying that a 10 drive raid0 could be the same performance as a 10+1
drive raid 5 or a 10+2 drive raid 6 array.
this is why I said 'same number of data disks, ignoring the parity disks'.
in practice you would probably not do quite this good anyway (you have the
parity calculation to make and the extra drive or two's worth of data
passing over your busses), but it could be a lot closer then any
implementation currently is.
Unfortunantly in practice filesystems don't support this, they don't do
enough readahead to want to keep the entire stripe (so after they read it
all in they throw some of it away), they (mostly) don't know where a stripe
starts (and so intermingle different types of data on one stripe and spread
data across multiple stripes unessasarily), and they tend to do writes in
small, scattered chunks (rather then flushing an entire stripes worth of
data at once)
In my experience, this theoretical maximum is not attainable without
significant write cache, and an intelligent controller, neither of which
Linux software RAID seems to have by default. My situation was a bit worse in
that I used applications that fsync() or journalled metadata that is ordered,
which forces the Linux software RAID to flush far more than it should - but
the same system works very well with RAID 1+0.
my statements above apply to any type of raid implementation, hardware or
software.
the thing that saves the hardware implementation is that the data is
written to a battery-backed cache and the controller lies to the system,
telling it that the write is complete, and then it does the write later.
on a journaling filesystem you could get very similar results if you put
the journal on a solid-state drive.
but for your application, the fact that you are doing lots of fsyncs is
what's killing you, becouse the fsync forces a lot of data to be written
out, swamping the caches involved, and requiring that you wait for seeks.
nothing other then a battery backed disk cache of some sort (either on the
controller or a solid-state drive on a journaled filesystem would work)
David Lang
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
http://www.postgresql.org/about/donate