On Tue, 25 Mar 2008, Peter Grandi wrote:
[ ... ]
I'd agree, a 12-14 disk raid6 is as high that I'd like to
go. This is mostly limited by rebuild-times though, you'd
preferably stay within a day or two of single-parity "risk".
A day or two? That's quite risky. Never mind that you get awful
performance for that day or two and/or a risk of data corruption.
Neil Brown some weeks on this mailing list expressed a very
cautionary thought:
«It is really best to avoid degraded raid4/5/6 arrays when at all
possible. NeilBrown»
Yes, I read that mail. I've been meaning to do some real-world testing of
restarting degraded/rebuilding raid6es from various vendors, including MD,
but haven't gotten around to it.
Luckily computer crashes are another order of magnitude rarer than disk
crashes on storage servers in our experience, so raid6 isn't a net loss
even assuming worst case if you have checksummed data.
Also, this might be the reason that for instance HP's raid cards require
battery-backed cache for raid6, so that there won't be partially updated
stripes without a "better version" in the cache.
* Large storage pools can only be reasonably built by using
multiple volumes across networks and on top of those some
network/cluster file system, [ ... ]
Or for that matter an application that can handle multiple
storage pools, many of the software that needs really large-scale
storage can itself split data store between multiple locations. [
... ]
I imagine that you are thinking here also of more "systematic"
(library-based) ways of doing it, like SRM/SRB and CASTOR, dCache,
XrootD, and other grid style stuff.
Yes that's what I work with, and similar solutions in other industries (I
know the TV folks have systems where the storage is just spread out over a
bunch of servers and filesystems instead of trying to do a cluster
filesystem, using a database of some kind to keep track of locations).
Funny, my suggestion would definately be raid6 for anything
except database(-like) load, that is anything that doesn't ends
up as lots of small updates.
But RAID6 is "teh Evil"! Consider the arguments in the usual
http://WWW.BAARF.com/ or just what happens when you update one
block in a RAID6 stripe and the RMW and parity recalculation
required.
I've read through most of baarf.com and I have a hard time seeing it
applying. An update to a block is a rare special case and can very well
take some time.
My normal usecase is to store large files
If the requirement is to store them as in a mostly-read-only cache,
RAID5 is perfectaly adequate; if it is to store them as in writing
them out, parity RAID is not going to be good unless they are
written as a whole (full stripe writes) or write rates don't matter.
Well, yes, full stripe writes is the normal case. Either a program is
writing a file out or you get a file from the network. It is written to
disk, and there is almost always sufficient write-back done to make it (at
least) full stripes (unless you have insanely large stripe size so that a
few hundred megs won't be enough for a few files).
You'll get a partial update at the file start, unless the filesystem is
stripe aligned, and one partial update at the end. The vast majority of
stripes in between (remember, I'm talking about large files, at least a
couple of hundred megs) will be full stripe writes.
and having 60% more disks really costs alot in both purchase and
power for the same usable space.
Well, that's the usual argument... A friend of mine who used to be
a RAID sw developer for a medium vendor calls RAID5 "salesperson's
RAID" because of this argument.
But look at the alternative, say for a 12-14 disk storage array
of say 750GB disks, which are currently best price/capacity (not
necessarily best price/power though), to result in this comparison:
Sounds like a resonable hardware to look at for a building block.
One RAID10 7x(1+1): 5.25TB usable.
Well, it has 40% less capacity than the others, but one gets
awesome resilience (especially if one has two arrays of 7 drives
and the mirror pairs are built across the two) including
surviving almost all 2-drive losses and most 3-drive losses, very
good read performance (up to 10-14 drives in parallel with '-p
f2') and very good write performance (7 drives with '-p n2'), all
exploitable in parallel, and very fast rebuild times impacting
only one of the drives, so the others have less chance of failing
during the rebuild. Also, there is no real requirement for the
file system code to carefully split IO into aligned stripes.
One RAID6 12+2: 9.00TB usable.
Any 3-drive loss is catastrophic, a 2 or 1 drive loss causes a
massive rebuild involving the whole array with the potential not
just for terrible performance but extra stress on the other
drives, and further drive loss. Write performance is going to be
terrible as for every N-blocks written we have to read N+2 blocks
and write N+2 blocks, especially bad news if N is small, and we
can avoid reading only if N happens to be 12 and aligned, but
read performance (if we don't check parity) is going to be pretty
good.
Funny, when I do this, the write performance is typically 20-40% slower
than the 14-disk RAID0 on the same disks. Not quite as terrible as you
make it out to be.
The performance during rebuilds usually depends on the priority given to
the rebuild, some do it slow enough that it isn't really affected, but
then it usually takes much longer.
Rebuilding after a loss is not not just goint to be slow
and risk further losses, but also carries the risk of corruption.
And that's the really big issue to me.
To me these summaries mean that for rather less than double the
cost in raw storage (storage is cheap, admittedly cooling/power is
less cheap) one gets a much better general purpose storage pool
with RAID10, and one that is almost only suited to large file
read-only caching in the other case.
So for a bit less than double the cost, I can get the same capability.
Perhaps with a little bit of extra performance that I can't make much use
of, because I only have 1-2Gbit/s network connection to the host anyway.
The single volume property only matters if one wants to build a
really large (above 5TB) single physical filesystem, and that's not
something very recommendable, so it does not matter a lot for me.
I wouldn't count "above 5TB" as "really large", if you are aiming for a
few PBs of aggregated storage the management overhead of sub-5TB storage
pools is rather significant. If you substitute that for "over 15TB" I'd
agree (today, probably not next year :) ).
Of course, we'd be more likely to go for a good hardware raid6
controller that utilises the extra parity to make a good guess on
what data is wrong in the case of silent data corruption on a
single disk (unlike Linux software raid).
That's crazy talk. As already argued in another thread, RAID as
normally understood relies totally on errors being notified.
Otherwise one needs to use proper ECC codes, on reading too (and
background scrubbing is not good or cheap enough) and that's a
different type of design.
Oh, but it does work on a background check, on some raid controllers. And
it does identify a parity mismatch, conclude that one of the disks is
[likely] wrong and then update data appropriately.
Of course, being storage hardware, you'd be lucky to see any mention of
this in any logs, when it should be loudly shouted that there were
mismatches.
But this works, in practice, today.
Unless, of course, you can run ZFS which has proper checksumming
so you can know which (if any) data is still good.
ZFS indeed only does checksumming, not full ECC with rebuild.
Oh, but a dual-parity raidz2 (~raid6) you have sufficient parity to
rebuild the correct data assuming not more than 2 disks have failed (as in
silently returning bad data). You just try to stick the data together in
all the different ways until you get one with matching checksum. Then you
can update parity/data as appropriate.
Same goes with n-disk mirrors, you just check until you find at least one
copy with a matching checksum, then update the rest to this data.
/Mattias Wadenstein
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html