Re: component growing in raid5

Mattias Wadenstein <maswan@xxxxxxxxxx> · Tue, 25 Mar 2008 14:38:47 +0100 (MET)

On Tue, 25 Mar 2008, Peter Grandi wrote:

 * Single volume filesystems larger than 1-2TB require something
   like JFS or XFS (or Reiser4 or 'ext4' for the brave). Larger
   than 5-10TB is not entirely feasible with any filesystem
   currently known (just think 'fsck' times) even if the ZFS
   people glibly say otherwise (no 'fsck' ever!).

The ZFS people provide an fsck, it's called "resilver", which checks 
parity and checksums and update accordingly.

 * Single RAID volumes up to say 10-20TB are currently feasible,
   say as 24x(1+1)x1TB (for example with Thumpers). Beyond that
   I would not even try, and even that is a bit crazy. I don't
   think that one should put more than 10-15 drives at most in
   a single RAID volume, even a RAID10 ones.

I'd agree, a 12-14 disk raid6 is as high that I'd like to go. This is 
mostly limited by rebuild-times though, you'd preferably stay within a day 
or two of single-parity "risk".

 * Large storage pools can only be reasonably built by using
   multiple volumes across networks and on top of those some
   network/cluster file system, and it matters a bit whether
   single filesystem image is essential or not.

Or for that matter an application that can handle multiple storage pools, 
many of the software that needs really large-scale storage can itself 
split data store between multiple locations. That way you can have 
resonably small filesystems and stay sane.

 * RAID5 (but not RAID6 or other mad arrangements) may be used if
   almost all accesses are reads, the data carries end-to-end
   checksums, and there are backups-of-record for restoring the
   data quickly, and then each array is not larger than say 4+1.
   In other words if RAID5 is used as a mostly RO frontend, for
   example to a large slow tape archive (thanks to R. Petkus for
   persuading me that there is this exception).

Funny, my suggestion would definately be raid6 for anything except 
database(-like) load, that is anything that doesn't ends up as lots of 
small updates. My normal usecase is to store large files and having 60% 
more disks really costs alot in both purchase and power for the same 
usable space.

Of course, we'd be more likely to go for a good hardware raid6 controller 
that utilises the extra parity to make a good guess on what data is wrong 
in the case of silent data corruption on a single disk (unlike Linux 
software raid). Unless, of course, you can run ZFS which has proper 
checksumming so you can know which (if any) data is still good.

A couple of relevant papers for inspiration on best practices by
those that have to deal with this stuff:

 https://indico.desy.de/contributionDisplay.py?contribId=26&sessionId=40&confId=257
 http://indico.fnal.gov/contributionDisplay.py?contribId=43&amp;sessionId=30&amp;confId=805

And this is my usecase. It might be quite different from, say, database 
storage or home directories.

/Mattias Wadenstein
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html