Re: With 4 disks should I go for RAID 5 or RAID 10

david@xxxxxxx · Wed, 26 Dec 2007 15:34:35 -0800 (PST)

On Wed, 26 Dec 2007, Mark Mielke wrote:

david@xxxxxxx wrote:
Thanks for the explanation David. It's good to know not only what but also
why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
read: the one with the data and the parity disk?
no, becouse the parity is of the sort (A+B+C+P) mod X = 0
so if X=10 (which means in practice that only the last decimal digit of 
anything matters, very convienient for examples)
A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0
if you read B and get 3 and P and get 4 you don't know if this is right or 
not unless you also read A and C (at which point you would get 
A+B+C+P=11=1=error)
I don't think this is correct. RAID 5 is parity which is XOR. The property of 
XOR is such that it doesn't matter what the other drives are. You can write 
any block given either: 1) The block you are overwriting and the parity, or 
2) all other blocks except for the block we are writing and the parity. Now, 
it might be possible that option 2) is taken more than option 1) for some 
complicated reasons, but it is NOT to check consistency. The array is assumed 
consistent until proven otherwise.

I was being sloppy in explaining the reason, you are correct that for 
writes you don't need to read all the data, you just need the current 
parity block, the old data you are going to replace, and the new data to 
be able to calculate the new parity block (and note that even with my 
checksum example this would be the case).

however I was addressing the point that for reads you can't do any 
checking until you have read in all the blocks.

if you never check the consistency, how will it ever be proven otherwise.

in theory a system could get the same performance with a large sequential 
read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same 
number of data disks, ignoring the parity disks) becouse the OS could read 
the entire stripe in at once, do the calculation once, and use all the data 
(or when writing, don't write anything until you are ready to write the 
entire stripe, calculate the parity and write everything once).
For the same number of drives, this cannot be possible. With 10 disks, on 
raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum 
performance is only 9/10 of the 10/10 performance possible with RAID 0.

I was saying that a 10 drive raid0 could be the same performance as a 10+1 
drive raid 5 or a 10+2 drive raid 6 array.

this is why I said 'same number of data disks, ignoring the parity disks'.

in practice you would probably not do quite this good anyway (you have the 
parity calculation to make and the extra drive or two's worth of data 
passing over your busses), but it could be a lot closer then any 
implementation currently is.

Unfortunantly in practice filesystems don't support this, they don't do 
enough readahead to want to keep the entire stripe (so after they read it 
all in they throw some of it away), they (mostly) don't know where a stripe 
starts (and so intermingle different types of data on one stripe and spread 
data across multiple stripes unessasarily), and they tend to do writes in 
small, scattered chunks (rather then flushing an entire stripes worth of 
data at once)
In my experience, this theoretical maximum is not attainable without 
significant write cache, and an intelligent controller, neither of which 
Linux software RAID seems to have by default. My situation was a bit worse in 
that I used applications that fsync() or journalled metadata that is ordered, 
which forces the Linux software RAID to flush far more than it should - but 
the same system works very well with RAID 1+0.

my statements above apply to any type of raid implementation, hardware or 
software.

the thing that saves the hardware implementation is that the data is 
written to a battery-backed cache and the controller lies to the system, 
telling it that the write is complete, and then it does the write later.

on a journaling filesystem you could get very similar results if you put 
the journal on a solid-state drive.

but for your application, the fact that you are doing lots of fsyncs is 
what's killing you, becouse the fsync forces a lot of data to be written 
out, swamping the caches involved, and requiring that you wait for seeks. 
nothing other then a battery backed disk cache of some sort (either on the 
controller or a solid-state drive on a journaled filesystem would work)

David Lang

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

               http://www.postgresql.org/about/donate