Hi all, OK, it took quite a bit longer to get basic throughput numbers than I had hoped, because I ran into a nasty performance bug and thought I'd better do something about it first. This is solved, and performance is now looking pretty good. I don't have a suitable high performance cluster at hand so I am working with a scsi array on a local machine. However, the full cluster synchronization stack is used, including a cluster-aware persistent dirty log (remembers which regions had in-flight writes in order to limit post-crash resyncing). Though I have five disks available, I used one for the persistent dirty log, so I am restricted to an order-one ddraid array of three disks at the moment. One of the disks is a dedicated parity disk, so the best I could hope for is a factor of two throughput increase versus a single raw disk. For larger transfer sizes, throughput does in fact double compared to raw IO to one of the scsi disks. For small transfer sizes, the overhead of parity calculations, dirty logging and bio cloning become relatively large versus raw disk IO, so breakeven occurs at about 64K transfer size. Below that, a single raw disk is faster and above, the ddraid array is faster. With 1 MB transfers, the ddraid array is nearly twice as fast. I tried various combinations disabling the parity calculations and dirty logging to see how the write overheads break down: No persistent dirty log, no parity calculations: Tie at 8K Almost twice as fast at 32K and above Parity calculations, no persistent dirty log: Tie between 8K and 16K Persistent dirty log, no parity calculations: Tie at 16K Parity calculations and persistent dirty log: Tie at 64K We see from this that dirty logging is the biggest overhead, which is no surprise. After that, the overhead parity calculation and basic bookkeeping seem about the same. The parity calculations can easily be optimized, the bio bookkeeping overhead will be a little harder. There are probably a few tricks remaining to reduce the dirty log overhead. The main point, though, is that even before a lot of optimization, performance looks good enough for production use. Some loads will be a little worse, some loads will perform dramatically better. Over time, I suspect various optimizations will reduce the per-transfer overhead considerably, so that the array is always faster. Whether or not the array is faster, it certainly is more redundant than a raw disk. This was my primary objective, and increased performance is just a nice fringe benefit. Of course, with no other cluster raid implemention to compete with, this is by default the fastest one :-) Performance notes: - Scatter/gather worries turned out to be unfounded. The scsi controller handles the thousands of sg regions I throw at it per second without measureable overhead. Just in case, I investigated the overhead of copying the IO into linear chunks via memcpy, which turned out to be small but noticeable. Gnbd and iscsi do network scatter/gather, which I haven't tested yet, but I would be surprised if there is a problem. - Read transfers have no dirty log overhead. - For reliability, I check parity on read. Later this will be an option, so read transfers don't necessarily have parity overhead either. - I think I may be able to increase read throughput to N times single disk throughput, instead of N-1. - I haven't determined where the bookkeeping overhead comes from, but I suspect most of it can be eliminated. - The nasty performance bug: releasing a dirty region immediately after all writes complete is a bad idea that really hammers the performance of back to back writes. There is now a timer delay on release for each region, which cures this. This code is also capable of running an n-way cluster mirror, but that feature is broken at the moment. I'll restore it next week and we can check mirror performance as well. I need to add degraded mode IO and parity reconstruction before this is actually useful, which I must put off until after LCA. Code cleanup is in progress and should land in cvs Monday or Tuesday. More benchmark numbers and hopefully some pretty charts are on the way. Regards, Daniel