Re: With 4 disks should I go for RAID 5 or RAID 10

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 26 Dec 2007, Mark Mielke wrote:

david@xxxxxxx wrote:
Thanks for the explanation David. It's good to know not only what but also
why. Still I wonder why reads do hit all drives. Shouldn't only 2 disks be
read: the one with the data and the parity disk?
no, becouse the parity is of the sort (A+B+C+P) mod X = 0
so if X=10 (which means in practice that only the last decimal digit of anything matters, very convienient for examples)
A=1, B=2, C=3, A+B+C=6, P=4, A+B+C+P=10=0
if you read B and get 3 and P and get 4 you don't know if this is right or not unless you also read A and C (at which point you would get A+B+C+P=11=1=error)
I don't think this is correct. RAID 5 is parity which is XOR. The property of XOR is such that it doesn't matter what the other drives are. You can write any block given either: 1) The block you are overwriting and the parity, or 2) all other blocks except for the block we are writing and the parity. Now, it might be possible that option 2) is taken more than option 1) for some complicated reasons, but it is NOT to check consistency. The array is assumed consistent until proven otherwise.

I was being sloppy in explaining the reason, you are correct that for writes you don't need to read all the data, you just need the current parity block, the old data you are going to replace, and the new data to be able to calculate the new parity block (and note that even with my checksum example this would be the case).

however I was addressing the point that for reads you can't do any checking until you have read in all the blocks.

if you never check the consistency, how will it ever be proven otherwise.

in theory a system could get the same performance with a large sequential read/write on raid5/6 as on a raid0 array of equivilent size (i.e. same number of data disks, ignoring the parity disks) becouse the OS could read the entire stripe in at once, do the calculation once, and use all the data (or when writing, don't write anything until you are ready to write the entire stripe, calculate the parity and write everything once).
For the same number of drives, this cannot be possible. With 10 disks, on raid5, 9 disks hold data, and 1 holds parity. The theoretical maximum performance is only 9/10 of the 10/10 performance possible with RAID 0.

I was saying that a 10 drive raid0 could be the same performance as a 10+1 drive raid 5 or a 10+2 drive raid 6 array.

this is why I said 'same number of data disks, ignoring the parity disks'.

in practice you would probably not do quite this good anyway (you have the parity calculation to make and the extra drive or two's worth of data passing over your busses), but it could be a lot closer then any implementation currently is.

Unfortunantly in practice filesystems don't support this, they don't do enough readahead to want to keep the entire stripe (so after they read it all in they throw some of it away), they (mostly) don't know where a stripe starts (and so intermingle different types of data on one stripe and spread data across multiple stripes unessasarily), and they tend to do writes in small, scattered chunks (rather then flushing an entire stripes worth of data at once)
In my experience, this theoretical maximum is not attainable without significant write cache, and an intelligent controller, neither of which Linux software RAID seems to have by default. My situation was a bit worse in that I used applications that fsync() or journalled metadata that is ordered, which forces the Linux software RAID to flush far more than it should - but the same system works very well with RAID 1+0.

my statements above apply to any type of raid implementation, hardware or software.

the thing that saves the hardware implementation is that the data is written to a battery-backed cache and the controller lies to the system, telling it that the write is complete, and then it does the write later.

on a journaling filesystem you could get very similar results if you put the journal on a solid-state drive.

but for your application, the fact that you are doing lots of fsyncs is what's killing you, becouse the fsync forces a lot of data to be written out, swamping the caches involved, and requiring that you wait for seeks. nothing other then a battery backed disk cache of some sort (either on the controller or a solid-state drive on a journaled filesystem would work)

David Lang


---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

               http://www.postgresql.org/about/donate

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux