>>> On Mon, 24 Mar 2008 16:17:29 +0100, Nagy Zoltan >>> <kirk@xxxxxxxx> said: > [ ... ] because the system is already up and running, i > don't wan't to recreate the array [ ... ] It looks like that you feel lucky :-). >> As a side note I am also curious why do you go the raid55 path >> (I am not very impressed however :) > okay - i've run thru the whole scenario a few times - and always > come get back to raid55, what would you do in myplace? :) Well, it all depends what the array is for. However from some of the previous messages it looks like it is pretty big, probably with several dozen drives in it across more than a dozen hosts. But it is not clear what it is being used for, except that IIRC it is mostly for reading. Things like access patterns (multithreaded or single client?) file size profile, availability required, etc., matter too. Anyhow as to very broadly applicable if not optimal guidelines, I would first apply Peter's (me :->) Law of RAID Level Choice: * If you don't know what you are doing, use RAID10. * If you know what you are doing, you are already using RAID10 (except in a very few special cases). To this I would add some general principles based on calls of judgement on my part, but that several people seem to judge differently (good luck to them!): * Single volume filesystems larger than 1-2TB require something like JFS or XFS (or Reiser4 or 'ext4' for the brave). Larger than 5-10TB is not entirely feasible with any filesystem currently known (just think 'fsck' times) even if the ZFS people glibly say otherwise (no 'fsck' ever!). * Single RAID volumes up to say 10-20TB are currently feasible, say as 24x(1+1)x1TB (for example with Thumpers). Beyond that I would not even try, and even that is a bit crazy. I don't think that one should put more than 10-15 drives at most in a single RAID volume, even a RAID10 ones. * Large storage pools can only be reasonably built by using multiple volumes across networks and on top of those some network/cluster file system, and it matters a bit whether single filesystem image is essential or not. So my suggestions are: * For larger filesystems I would use multiple Thumpers (or equivalent) divided in multiple 2TB volumes with a network filesystem like OpenAFS for home directories or a parallel network filesystem like Lustre for data directories. * Multiple 2-4TB RAID10 volumes each with a JFS/XFS filesystem exported via NFSv4 might be acceptable if single filesystem image semantics are not required. * Consider the case for doing RAID10 over the network, by having for example two Thumpers with 48 drives and creating 48 RAID1 pairs across the network using DBRD, and then creating 2-4TB RAID0 volumes with a half a dozen of those pairs each. * RAID5 (but not RAID6 or other mad arrangements) may be used if almost all accesses are reads, the data carries end-to-end checksums, and there are backups-of-record for restoring the data quickly, and then each array is not larger than say 4+1. In other words if RAID5 is used as a mostly RO frontend, for example to a large slow tape archive (thanks to R. Petkus for persuading me that there is this exception). A couple of relevant papers for inspiration on best practices by those that have to deal with this stuff: https://indico.desy.de/contributionDisplay.py?contribId=26&sessionId=40&confId=257 http://indico.fnal.gov/contributionDisplay.py?contribId=43&sessionId=30&confId=805 > i choosed this way because: > * hardware raid controllers are expensive - because of > this i prefer rather having a cluster of machines > (average cost per MB shows that this is the 'cheapest' > solution) this solution's impact on avg cost is about > 20-25% compared to a single stand-alone disk - 40-50% if > i count only usable storage That's strange. Especially as iSCSI host adapters are not exactly cheaper than SAS/SATA ones. > * as far as i know other raid configurations take a bigger > piece from the cake > - raid10, raid01 both halves the usable space, simply > creating a raid0 array at the top level could suffer > complete destruction if a node fails (in some rare > cases the power-supply can take everything along with > it) Check out http://WWW.BAARF.com/ for the "but RAID10 is not cost effective" argument :-). > - raid05 could be reasonable choice providing n*(m-1) > space: but in case of failure a single disk would > trigger a full scale rebuild Try to imagine what happens when 2 disks fail, either in two different leaves or in the same leaf. Oops. > * raid55 - considering an array of n*m disks, gives > (n-1)*(m-1) usable space with the ability to detect > failing disks and repair them, while the cluster is still > online - i can even grow it without taking it offline! ;) Assuming that there are no downsides :-) this makes perfect sense. > and at the leafs the processing power required for the > raid is already there... why not use it? ;) Which processing power required? RAID on current CPUs is almost trivial, and with multilane PCIe and multibank fast DDR2 even bandwidth is not a big deal. > * this is because with iscsi i can detach the node, and when > i reattach the node it's size is redetected Sure, and much good it does you to have nodes of different sizes in a RAID5 :-). Anyhow SAS/SATA is usually plug and play too. [ ... ] > * an alternate solution could be: drop the top level raid5 > away, and replace it with unionfs - by creating individual > filesystems, there is an intresting thing about raiding > filesystems(raif) This would be a bit better, see above. Though I wonder why one would need 'unionfs', as one could mount all lower level volumes' filesystems into subdirectories. That may not be acceptable, but 'unionfs' would not allow things like cross-filesystem hard links for example, so not a big deal. [ ... ] > * this cluster could scale up at any time by assimilating > new nodes ;) Assuming that there are now downsides to that, fine ;-). -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html