On 2013-05-31T10:58:35, Brassow Jonathan <jbrassow@xxxxxxxxxx> wrote: Hi Jon, thanks for the response. > What is your test set-up in which you are seeing the performance hit? We saw it essentially as soon as we used "lvcreate -m1" on an LV; the performance drop would be about 70% compared to -m0. Exclusive activation would bring it back to ~95%. (I wonder if that could be related to still using an older lvm2 user-space. Hm. Need to benchmark again with the latest.) Actually concurrency via OCFS2 would of course have a worse impact. I'm not quite sure where the split is; many use cases are "One active process, but they want the LV to be visible everywhere and the election to be automatic", but a fair share is "concurrent processes/OCFS2/GFS2, replacing traditional expensive SAN mirroring (while not being ready for ceph/gluster)" too. The first part is reasonably trivial - enforce the one open process via an exclusive lock automatically, and get to avoid the entire rest of the cluster overhead. Optimizing the latter case is harder. > I have given some thought to making MD RAID1 cluster-aware. (RAID10 > would come for free, but RAID4/5/6 would be excluded.) Yes, I guess that cover 99% of the interesting pieces anyway. > would then make use of this code via the dm-raid.c wrapper. My idea > for the new implementation would have been to keep a separate bitmap > area for each machine. This way, there would be no locking and no > need to keep the log state collectively in-sync during nominal > operation. Yes, I can see this could work. (Similarly to how OCFS2/GFS2 keep a per-node journal too.) Could be a bit tricky to get to the point where you could do full read-balancing. And, of course, this whole complexity is only needed for the truly concurrent IO. > When machines come, go or fail, their bitmaps would have > to be merged and responsibility for recovery/initialization/scrubbing > would have to be decided. Right, but that's easy enough. > Additionally, handling device failures is > more tricky in MD RAID. This is because MD RAID (and by extension, > the device-mapper targets that leverage it) simply marks a device as > failed in the superblock and keeps working while DM "mirror" blocks > I/O until the failed device is cleared. I guess though that could be enhanced. > This makes a difference in the cluster because one machine may suffer > a device failure due to connectivity and another machine may not. Yes, understood. And we then want to degrade as gracefully as we can, too. (I sometimes keep wondering if, depending on the interconnect in the cluster, the "ship all writes to a central process, read locally, fail-over that process" isn't cheaper. I'm pretty sure that for read-intensive workloads, it probably is.) > If the machine suffering the failure simply marks the failure in the > superblock (which will also need to be coordinated) and proceeds, the > other machine may then attempt a read from the device and grab a copy > of data that is stale. Right. Cheapest way is to mark the drive failed on all nodes at the same time; anything else actually does require write-shipping. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel