Re: mdadm software raid5 arrays?

"H. Langos" <henrik-vdr@xxxxxxxx> · Mon, 9 Nov 2009 18:49:52 +0100

Hi Simon,

On Sat, Nov 07, 2009 at 07:38:03AM +1300, Simon Baxter wrote:
> Hi
>
> I've been running logical volume management (LVMs) on my production VDR 
> box for years, but recently had a drive failure.  To be honest, in the 
> ~20 years I've had PCs in the house, this is the first time a drive 
> failed!
>
> Anyway, I've bought 3x 1.5 TB SATA disks which I'd like to put into a  
> software (mdadm) raid 5 array.
>
...
>
> I regularly record 3 and sometimes 4 channels simultaneously, while 
> watching a recording.  Under regular LVM, this sometimes seemed to cause 
> some "slow downs".

I know I risk a flame war here but I feel obliged to say it:
Avoid raid5 if you can avoid it! It is fun to play with but
if you care for your data buy a fourth drive and do raid1+0 
(mirroring and striping) instead.

Raid 5 is very fast on linear read operations because basically
the load will be spread onto all the available drives.
But if you are going to run vdr on that drive array, you are going
to do a lot of write operations, and raid5 is bad if you do a lot
of writes for a very simple reason.

Take a raid5 array with X devices. If you want to write just one
block, you need to read 2 blocks (the old data that you are
going to overwrite and the old parity) and you need to write 2
blocks (one with the actual data and one with the new parity).

In the best of case, the disk block that you are going to
overwrite is already in ram, but the parity block almost never
will be. Only if you keep writing the same block over and over,
you'll have data and parity blocks cached.
In most cases (and certainly in the case of writing data streams
on disk) you'll need to read two blocks before you can calculate
the new parity and write it back to the disks along with your data.

So in short you do two reads and two writes for every write operation.
There goes your performance...

Now about drive failures... if one of X disks fails, you can still
read blocks on the OK drives with just one read operation but you
need X-1 read operations for every read operation on the failed drive.
Writes on OK drives have the same two reads/two writes as before,
(only if the failed drive contained the parity for this block you
can skip the additional two reads and one write).
If however you need to write on the the failed drive, then you need 
to read every other X-1 drive in the array to first reconstruct 
the missing data and then you can calculate and write the new 
parity. (and then you throw away the actual data that you were 
going to write because the drive that you could write it to is 
gone...)

Example: You have your three 1.5TB drives A B C in an array
and C fails. In this situation you'd want to treat your drives as
carefully as possible because one more failure and all your data
is gone. Unfortunately continued operating in fail condition will
put your remaining drives under much more stress than usually.

Reading will cause twice the read operations on your remaining 
drives.

block    : n   n+1 n+2
OK State : a   b   c  
Failstate: a   b   ab  

Writing (on a small array) will produce the same load of two reads
and two writes average per write.

block: n     n+1    n+2   
OK:    acAC  baBA   cbCB 
FAIL:  A     baBA   baB

Confusingly enough the read load per drive doesn't change if 
you have more than three drives in your array. Reads will still 
produce on average double the load in failed state.

Writes on a failed array seem to produce the same load as on 
an OK array. But this is only true for very small arrays. 
If you add more disks you'll see that the "read penalty" grows
for writing blocks where the data disk is missing and you need 
to read all other drives in order to update th parity.

Reconstruction of you array after adding a new drive will take
a long time and most of complete array failures (i.e. data lost
forever) occure during the rebuilding phase, not during in the 
fail state. Thats simply because you put a lot of stress on 
your drives (that probably come from same batch as the one 
that already failed).

Depending on the number and nature of your drives and the
host connection they have, the limiting factor can read 
performance (you need to read X-1 drives completely) or 
it can be the write performance if your disk is slower on 
sustained writing than on reading.

Remember that you need to read and write a whole disks worth
of data, not just the used parts.

Example: Your drives have 1.5tb and we assume that you have
a whoopin 100MB/s on read as well as on write. (pretty much the 
fastest there currently is).

You need to read 3tb as well as write 1.5tb. if your system can
handle the load in parallel you can treat it as just writing one
1.5tb drive. 1500000mb/100mb/s/60s/m makes 250 minutes or 4 hours 
and 10 minutes. I am curious if you can still use the system under
such an io load. Anybody with experience on this? Anyway the 
reconstruction rate can be tuned via the proc fs.

Now for the raid 1+0 alternative with the same resulting storage 
capacity you'll need 4 instead of 3 drives.

In OK state one read command will result in one read operation but the
operation can be completed on any drive that is part of the mirror set. 
So seek performance will be much better as the io-scheduler will select 
the drive that is currently not busy and/or who's head is closer to the 
requested block. As you do mirroring and striping you can use all four drives' 
performance for linear reading. You end up with 33% more read performance 
than with the raid5 setup (but hey you paid 33% more as well :-) )

Writing one block requires two write operations instead of two reads
and two writes and since you don't need to read the old data before writing
the new stuff, you don't need to wait for the heads to move around, and the 
disk to rotate to the right place and the read operation to get the data 
from the disk to ram first. You can simply write to the disk and let the 
disk's controller handle the rest. In other words: Your write performance 
will be much better than with raid5.

In failed state (lets assume drive C of A=B+C=D fails), reading 
performance will drop by 33% as one drive is missing. The mirror
drive of C will have to handle the load by itself:

block: n   n+1 n+2 n+3 
ok:    a   c   b   d
fail:  a   d   b   d

This again assumes that the load is shared equally between the drives of
a mirror set and is probably true for long sustained reads. In reality 
the scheduler would select the drive that is currently not busy and/or 
who's head is closer to the region you want to read. So if you are 
reading two streams of data that are stored in different regions
of the disk, the disk in a raid5 array would have to do a lot of seek 
operations while the raid1+0 would keep one head on each stream's 
location and only quietly jump from one track to the next (assuming 
your disk is not heavily fragmented). If one of the two disks in a mirror
set fails you'll have the heads jumping again.

Writing on an array with a failed drived maintains the same for load
for each individual drive and the performance will also stay the same.

block: n   n+1 n+2 n+3
ok:    AB  CD  AB  CD
fail:  AB  D   AB  D

Rebuilding will require to read the mirrored drive and write the new one.
So you'll need to read 1.5tb and write 1.5tb. It will take the same time
but produce less system load than in the raid5 example and only one old 
disk will be put under a lot of stress instead of all remaining drives.

Btw: Your raid 1+0 array can handle two drive failures as long as they
don't occure in the same mirror set. so A and C or B and D could fail
and you'd still have all your data. Naturally Murphy's law applies and
if you continue reading from that array you will stress that single 
remaining drive more than the others and its chances to fail will 
increase.

But if you are worried about double faults you might as well run raid6 
on those 4 drives ... but don't ask for performance there.

In all this I assume that you have a backup on another drive of all data
that you care about. If you don't, WHAT THE F*** ARE YOU DOING? You are
trusting your data to microscopic particles of rotating rust...

Use two of the three drives as raid1 device that will quickly get your data 
in and out and use the third as a backup device that will hold copies of the 
data that you care about. That way you are safe against single drive failure
and against stupid users/software. Assuming that your backup drive is not 
mounted/accesible all the time.

If you have a lot of data that you don't realy care about, you can use two 
of the three drives as raid0 device and use the third to only backup the data 
that is important to you. 

I know you could use LVM to create one big volumegroup on to manage all three 
disks and create the logical volumes that you store important data on with 
a "--mirrors" argument proportional to your paranoia but this would still only 
protect you from hardware failures. To have protection against software/user 
failures you'd need to do snapshots as well and I don't like the way you have 
to do their growing and shrinking manually plus it would still all be "online"
and vulnerable to typos in "dd" commands..

enough time wasted.. just one more thing ... all those RAID thingies assume
that you trust your disks to fail silently, i.e. return nothing instead of 
returning wrong data. if you wanted to protect against this you'd have
to forget about improved performance and instead be content with the
performance of your slowest drive. for each read you'd have to read the a
block from each of your X drives in a raid array and compare the computed 
parity with the one read from disk, or in the simple raid1+0 you'd have to 
read both copies and compare them.

cheers
-henrik

_______________________________________________
vdr mailing list
vdr@xxxxxxxxxxx
http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr