Hi all, A number of participants on this list have speculated about the possibility of running higher level raid on a cluster, to provide lower redundancy and greater throughput than simple cluster mirroring. This is a natural idea, and I am pleased to be able to say that it is now a reality here on my workbench. This project started life as a cluster mirror back before Jon decided to try extending the existing dm-raid1 driver to work on a cluster. It got mothballed at that time, but luckily I had already promised a cluster raid paper for LCA next month. With that deadline looming, I unmothballed it and went to work. To add spice, I thought I would try to extend it to a higher raid level. Well, raid 5 is really hard to implement under the best of conditions, but on a cluster it gets really crazy due to caching issues[1]. Raid 2 and 3 are generally considered impractical to implement in software because of the necessary per-byte parity handling. Fortunately, there is a variant I investigated a couple of years ago and published a paper on, that I call raid 3.5. This approach restricts the number of disks in an array to 2**k+1 and divides each filesystem block into a binary number of fragments which are striped across the array. You can think of this as sub-block striping, as opposed to raid 4 and 5 which use multi-block striping. The big advantage of raid 3.5 over raid 4 or 5 is that each parity block does not tie together multiple data blocks, so caching can be avoided. Each data block can be written directly to or read directly from the array, along with its associated parity. Here is a nice picture of how raid 3 works: http://www.acnc.com/04_01_03.html Raid 3.5 is much like that, except for the way data blocks are striped across disks, which is designed to make it easy to implement in software. On a single node, the main attraction of raid 3.5 is, it is much easier to implement than raid 4 or 5. However, on a cluster raid 3.5 really comes into its own: decoupling the parity blocks from unrelated data blocks is a fundamental advantage. Per-block synchronization becomes unnecessary, which would otherwise be complex and costly. More importantly, there is no need to read other blocks of a parity set before updating, which would create a horrible new level of data caching and cluster synchronization issues. The main theoretical disadvantage of raid 3.5 versus raid 5 is lower transaction throughput, because each filesystem transaction involves all disks of the array, whereas raid 5 can sometimes seek in parallel for parallel filesystem transactions. This problem disappears on a cluster because you can fill each cluster data node with many disks and stripe them together, yielding the same transaction parallelization characteristics as raid 4 or 5. Another disadvantage of raid 3.5 is the limited number of supported array configurations. Each 4K filesystem block can only be divided into eight pieces before hitting the 512 byte Linux sector addressing limit. This means that for practical purposes, raid 3.5 can only have 3, 5 or 9 members[2]. Well, that is not a big problem when you consider that, until now, you could not do cluster level data redundancy at all, other than mirroring. Now to put this in perspective. The availability of higher level cluster raid gives us a new type of cluster, a "distributed data" cluster, as opposed to what GFS currently offers, which I call a "shared data" cluster. With a distributed data cluster, you can unplug one of the data nodes and the cluster will continue running without any data loss and with negligible loss of performance. Ironically, many naive observers tend to hold the belief that this is exactly the way a cluster filesystem is supposed to work and we must disappoint them by telling them you actually need to share a disk. Well, now we have the means to cater to that common fantasy and avoid the disappointment. Note that the number of data nodes in a distributed data cluster is independent from the total number of nodes. You could, for example, have a 100 node cluster with 5 data nodes. This is superior to running the cluster off a mirrored disk because storage redundancy is reduced from 50% to 20% and maximum read throughput is increased from 2X to 4X. Unlike a mirror, raid 3.5 increases maximum write throughput as well. It turned out that raid 3.5 was quite easy to graft on top of my cluster mirror prototype, which only took a couple of days. Except for one thing: handling multi-page bio transfers is really tedious. Going from single page bio transfers to multipage transfers has kept me busy for the better part of a week, and I am still putting the finishing touches on it. But the thing is, now a single bio can handle a massive amount of contiguous filesystem data. Even better, the parity calculations for all the eesny weensy little block fragments line up nicely and coalesce into contiguous runs inside each multipage bio. Though you would think raid 3.5 makes a real hash of transfer contiguity, actually it does not. In essence, I finesse the problem onto the scatter gather hardware[3]. I have tested up to 1 megabyte contiguous transfers, which works very well[4]. I have not yet had time to test throughput in a realistic environment, so I cannot report performance numbers today. Hopefully tomorrow. Code will be in cvs on Wednesday, I think. This is a new project, tentatively called "ddraid" where dd means "distributed data". A project page will appear in the fullness of time. And of course that LCA paper, which I will present next month in Canberra. The current code is production quality in a number of interesting respects, particularly the main read/write path. I have implemented read-with-verify on the read path, where each read transaction checks the parity fragments to ensure nothing has gone wrong in hardware or software. This works nicely, and I wonder why raid implementations do not typically provide such an option. I built the ddraid implementation using the same template (and code) as the cluster snapshot, which saved an immense amount of time and hopefully will save maintenance time as well Anybody who is interested in how that works can find reasonably up to date documentation here: http://sourceware.org/cluster/csnap/cluster.snapshot.design.html http://sourceware.org/cluster/csnap/csnap.ps Like the cluster snapshot, ddraid has a userspace synchronization server. And like any cluster server (userspace or not) this raises difficult memory deadlock issues. Hopefully, after LCA I can sit down and solve these issues definitively for these and other cluster server components. Due to time pressure, I must leave some work hanging until after LCA: * Degraded read (reconstruct from parity) * Degraded write * Parity reconstruction (currently can only reconstruct raid 1) * Some aspects of failover OK, thanks a lot for reading this far, time to get back to work. I will provide further details in a couple of days. [1] To update one block of a raid 4 or 5 parity set, at least two parity set members must first be read and other nodes must be prevented from modifying these before the update is completed. [2] Two member raid 3.5 is equivalent to raid 1. [3] It will be interesting to see how well the scatter gather hardware holds up under the load of tens of thousands of tiny transfer fragments per second. [4] The only real limit to transfer size is the size of a region in the dirty log, which persistently records which parts of the data array need to be reconstructed if the system fails at any given instant. Typical dirty log region size runs from a few KB to a few MB. Regards, Daniel