RBD image locking is on roadmap, but it's tricky. Almost all of the pieces are in place for exclusive locking of the image header, which will let the user know when other nodes have the image mapped, and give them the option to break their lock and take over ownership. The real challenge is fencing. Unlike move conventional options like SCSI, the RBD image is distributed across the entire cluster, so ensuring that the old guy doesn't still have IOs in flight that will stomp on the new owner means that potentially everyone needs to be informed that the bad guy should be locked out. I think there are a few options: 1- The user has their own fencing or STOGITH on top of rbd, informed by the rbd locking. Pull the plug, update your iptables, whatever. Not very friendly. 2- Extend the rados 'blacklist' functionality to let you ensure that every node in the cluster has received the updated osdmap+blacklist information, so that you can be sure no further IO from the old guy is possible. 3- Use the same approach that ceph-mds fencing uses, in which the old owner isn't known to be fenced away from a particular object until the new owner reads/touches that object. My hope is that we can get away with #3, in which case all of the basic pieces are in place and the real remaining work is integration and testing. The logic goes something like this: File systems write to blocks on disk in a somewhat ordered fashion. After writing a bunch of data, they approach a 'consistency point' where their journal and/or superblocks must be flushed and things 'commit' to disk. At that point, if the IO fails or blocks, it won't continue to clobber other parts of the disk. When an fs in mounted, those same critical areas are read (superblock, journal, etc.). The existing client/osd interaction ensures that if the new guy knows that the old guy is fenced, the act of reading ensures that the relevant ceph-osds will find out too and that paticular object will be fenced. The resulting conclusion is that if a file system (or application on top of it doing direct io) is sufficiently well-behaved that will be not corrupt itself when the disk reorders IOs (they do) and issues barrier/flush operations at the appropriate time (in modern kernels, they do), then it will work. I suppose it's roughly analogous to Schroedinger's cat: until the new owner reads a block, it may or may not still be modified/modifiable by the old guy, but as soon as it is observed, its state is known. What do you guys think? If that doesn't work, I think we're stuck with #2, which is expensive but doable. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html