On 1/12/2012 6:47 AM, Peter Grandi wrote: > [ ... ] > >>> That to me sounds a bit too fragile ; RAID0 is almost always >>> preferable to "concat", even with AG multiplication, and I >>> would be avoiding LVM more than avoiding MD. > >> This wholly depends on the workload. For something like >> maildir RAID0 would give you no benefit as the mail files are >> going to be smaller than a sane MDRAID chunk size for such an >> array, so you get no striping performance benefit. > > That seems to me unfortunate argument and example: > > * As an example, putting a mail archive on a RAID0 or 'concat' > seems a bit at odd with the usual expectations of availability > for them. Unless a RAID0 or 'concat' over RAID1. Because anyhow > 'maildir' mail archive is a horribly bad idea regardless > because it maps very badly on current storage technology. WRT availability both are identical to textbook RAID10--half the drives can fail as long as no two are in the same mirror pair. In essence RAID0 over mirrors _is_ RAID10. I totally agree with your maildir sentiments--far too much physical IO (metadata) needed for the same job as mbox, Dovecot's mdbox, etc. But, maildir is still extremely popular and in wide use, and will be for quite some time. > * The issue if chunk size is one of my pet peeves, as there is > very little case for it being larger than file system block > size. Sure there are many "benchmarks" that show that larger > chunk sizes correspond to higher transfer rates, but that is > because of unrealistic transaction size effects. Which don't > matter for a mostly random-access share mail archive, never > mind a maildir one. I absolutely agree. Which is why the concat makes sense for such workloads. From a 10,000 ft view, it is little different than having a group of mirror pairs, putting a filesystem on each, and manually spreading one's user mailboxen over those filesystems. The XFS over concat simply takes the manual spreading aspect out of this, and yields pretty good transaction load distribution. > * Regardless, an argument that there is no striping benefit in > that case is not an argument that 'concat' is better. I'd still > default to RAID0. The issue with RAID0 or RAID10 here is tuning XFS. With a striped array XFS works best if sunit/swidth match the stripe block characteristics of the array, as it attempts to pack a full stripe worth of writes before pushing down the stack. This works well with large files, but can be an impediment to performance with lots of small writes. Free space fragmentation becomes a problem as XFS attempts to stripe align all writes. So with maildir you often end up with lots of partial stripe writes, each being a different stripe. Once an XFS filesystem ages sufficiently (i.e. fills up) more head seeking is required to write files into the fragmented free space. At least, this is my understanding from Dave's previous explanation. Additionally, when using a striped array, all XFS AGs are striped down the virtual cylinder that is the array. So when searching a large directory btree, you may generate seeks across all drives in the array to find a single entry. With a properly created XFS on concat, AGs are aligned and wholly contained within a mirror pair. And since all files created within an AG have their metadata also within that AG, any btree walking is done only on an AG within that mirror pair, reducing head seeking to one drive vs all drives in a striped array. The concat setup also has the advantage that per drive read-ahaead is more likely to cache blocks that actually will be needed shortly, i.e. the next file in an inbox, whereas with a striped array it's very likely that the next few blocks contain a different user's email, a user who may not even be logged in. > * Consider the dubious joys of an 'fsck' or 'rsync' (and other > bulk maintenance operations, like indexing the archive), and > how RAID0 may help (even if not a lot) the scanning of metadata > with respect to 'concat' (unless one relies totally on > parallelism across multiple AGs). This concat setup is specific to XFS and only XFS. It is useless with any other (Linux anyway) filesystem because no others use an allocation group design nor can derive meaningful parallelism in the absence of striping. > Perhaps one could make a case that 'concat' is no worse than > 'RAID0' if one has a very special case that is equivalent to > painting oneself in a corner, but it is not a very interesting > case. It better than a RAID0/10 stripe for small file random IO workloads. See reasons above. >> And RAID0 is far more fragile here than a concat. If you lose >> both drives in a mirror pair, say to controller, backplane, >> cable, etc failure, you've lost your entire array, and your >> XFS filesystem. > > Uhm, sometimes it is not a good idea to structure mirror pairs so > that they have blatant common modes of failure. But then most > arrays I have seen were built out of drives of the same make and > model and taken out of the same carton.... I was demonstrating the worst case scenario that could take down both array types, and the fact that when using XFS on both, you lose everything with RAID0, but can likely recover to a large degree with the concat specifically because of the allocation group design and how the AGs are physically laid down on the concat disks. >> With a concat you can lose a mirror pair, run an xfs_repair and >> very likely end up with a functioning filesystem, sans the >> directories and files that resided on that pair. With RAID0 >> you're totally hosed. With a concat you're probably mostly >> still in business. > > That sounds (euphemism alert) rather optimistic to me, because it > is based on the expectation that files, and files within the same > directory, tend to be allocated entirely within a single segment > of a 'concat'. This is exactly the case. With 16x1TB drives in an mdraid linear concat with XFS and 16 AGs, you get exactly 1 AG on each drive. In practice in this case one would probably want 2 AGs per drive, as files are clustered around the directories. With the small file random IO workload this decreases head seeking between the directory write op and the file write up, which typically occur in rapid succession. > Even with distributing AGs around for file system > types that support that, that's a bit wistful (as is the > expectation that AGs are indeed wholly contained in specific > segments of a 'concat'). No, it is not, and yes, they are. > Usually if there is a case for a 'concat' there is a rather > better case for separate, smaller filesystems mounted under a > common location, as an alternative to RAID0. Absolutely agreed, for the most part. If the application itself has the ability to spread the file transaction load across multiple directories this is often better than relying on the filesystem to do it automagically. And if you lose one filesystem for any reason you've only lost access to a portion of data, not all of it. The minor downside is managing multiple filesystems instead of one, but not a big deal really, given the extra safety margin. In the case of the maildir workload, Dovecot, for instance, allows specifying a mailbox location on a per user basis. I recall one Dovecot OP who is doing this with 16 mirror pairs with 16 EXTx filesystems atop. IIRC he was bitten more than once by single large hardware RAID setups going down--I don't recall the specifics. Check the Dovecot list archives. > It is often a better case because data is often partitionable, > there is no large advantage to a single free space pool as most > files are not that large, and one can do fully independent and > parallel 'fsck', 'rsync' and other bulk maintenance operations > (including restores). Agreed. If the data set can be partitioned, and if your application permits doing so. Some do not. > Then we might as well get into distributed partitioned file > systems with a single namespace like Lustre or DPM. Lustre wasn't designed for, nor is suitable for, high IOPS low latency, small file workloads, which is, or at least was, the topic we are discussing. I'm not familiar with DPM. Most distributed filesystems aren't suitable for this type of workload due to multiple types of latency. > But your (euphemism alert) edgy recovery example above triggers a > couple of my long standing pet peeves: > > * The correct response to a damaged (in the sense of data loss) > storage system is not to ignore the hole, patch up the filetree > in it, and restart it, but to restore the filetree from backups. > Because in any case one would have to run a verification pass > aganst backups to see what has been lost and whether any > partial file losses have happened. I believe you missed the point, and are making some incorrect assumptions WRT SOP in this field, and the where-with-all of your colleagues. In my concat example you can likely be back up and running "right now" with some loss _while_ you troubleshoot/fix/restore. In the RAID0 scenario, you're completely down _until_ you troubleshoot/fix/restore. Nobody is going to slap a bandaid on and "ignore the hole". I never stated nor implied that. I operate on the assumption my colleagues here know what they're doing for the most part, so I don't expend extra unnecessary paragraphs on SOP minutia. [snipped] -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html