Re: No lock on RBD allow several mount on different servers...

Sage Weil <sage@xxxxxxxxxxx> · Sun, 12 Aug 2012 08:44:34 -0700 (PDT)

RBD image locking is on roadmap, but it's tricky.  Almost all of the 
pieces are in place for exclusive locking of the image header, which will 
let the user know when other nodes have the image mapped, and give them 
the option to break their lock and take over ownership.

The real challenge is fencing.  Unlike move conventional options like 
SCSI, the RBD image is distributed across the entire cluster, so ensuring 
that the old guy doesn't still have IOs in flight that will stomp on the 
new owner means that potentially everyone needs to be informed that the 
bad guy should be locked out.

I think there are a few options:

1- The user has their own fencing or STOGITH on top of rbd, informed by 
   the rbd locking.  Pull the plug, update your iptables, whatever.  Not 
   very friendly.
2- Extend the rados 'blacklist' functionality to let you ensure that every 
   node in the cluster has received the updated osdmap+blacklist 
   information, so that you can be sure no further IO from the old guy is 
   possible.
3- Use the same approach that ceph-mds fencing uses, in which the old 
   owner isn't known to be fenced away from a particular object until the 
   new owner reads/touches that object.

My hope is that we can get away with #3, in which case all of the basic 
pieces are in place and the real remaining work is integration and 
testing.  The logic goes something like this:

File systems write to blocks on disk in a somewhat ordered fashion.  
After writing a bunch of data, they approach a 'consistency point' where 
their journal and/or superblocks must be flushed and things 'commit' to 
disk.  At that point, if the IO fails or blocks, it won't continue to 
clobber other parts of the disk.

When an fs in mounted, those same critical areas are read (superblock, 
journal, etc.).  The existing client/osd interaction ensures that if 
the new guy knows that the old guy is fenced, the act of reading 
ensures that the relevant ceph-osds will find out too and that 
paticular object will be fenced.

The resulting conclusion is that if a file system (or application on top 
of it doing direct io) is sufficiently well-behaved that will be not 
corrupt itself when the disk reorders IOs (they do) and issues 
barrier/flush operations at the appropriate time (in modern kernels, they 
do), then it will work.

I suppose it's roughly analogous to Schroedinger's cat: until the new 
owner reads a block, it may or may not still be modified/modifiable by the 
old guy, but as soon as it is observed, its state is known.

What do you guys think?  If that doesn't work, I think we're stuck with 
#2, which is expensive but doable.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html