Re: No lock on RBD allow several mount on different servers...

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 13 Aug 2012 09:55:51 -0700

We've discussed some of the issues here a little bit before. See
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/7094 if
you're interested.

Josh, can you discuss the current status of the advisory locking?
-Greg

On Sun, Aug 12, 2012 at 8:44 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> RBD image locking is on roadmap, but it's tricky.  Almost all of the
> pieces are in place for exclusive locking of the image header, which will
> let the user know when other nodes have the image mapped, and give them
> the option to break their lock and take over ownership.
>
> The real challenge is fencing.  Unlike move conventional options like
> SCSI, the RBD image is distributed across the entire cluster, so ensuring
> that the old guy doesn't still have IOs in flight that will stomp on the
> new owner means that potentially everyone needs to be informed that the
> bad guy should be locked out.
>
> I think there are a few options:
>
> 1- The user has their own fencing or STOGITH on top of rbd, informed by
>    the rbd locking.  Pull the plug, update your iptables, whatever.  Not
>    very friendly.
> 2- Extend the rados 'blacklist' functionality to let you ensure that every
>    node in the cluster has received the updated osdmap+blacklist
>    information, so that you can be sure no further IO from the old guy is
>    possible.
> 3- Use the same approach that ceph-mds fencing uses, in which the old
>    owner isn't known to be fenced away from a particular object until the
>    new owner reads/touches that object.
>
> My hope is that we can get away with #3, in which case all of the basic
> pieces are in place and the real remaining work is integration and
> testing.  The logic goes something like this:
>
> File systems write to blocks on disk in a somewhat ordered fashion.
> After writing a bunch of data, they approach a 'consistency point' where
> their journal and/or superblocks must be flushed and things 'commit' to
> disk.  At that point, if the IO fails or blocks, it won't continue to
> clobber other parts of the disk.
>
> When an fs in mounted, those same critical areas are read (superblock,
> journal, etc.).  The existing client/osd interaction ensures that if
> the new guy knows that the old guy is fenced, the act of reading
> ensures that the relevant ceph-osds will find out too and that
> paticular object will be fenced.
>
> The resulting conclusion is that if a file system (or application on top
> of it doing direct io) is sufficiently well-behaved that will be not
> corrupt itself when the disk reorders IOs (they do) and issues
> barrier/flush operations at the appropriate time (in modern kernels, they
> do), then it will work.
>
> I suppose it's roughly analogous to Schroedinger's cat: until the new
> owner reads a block, it may or may not still be modified/modifiable by the
> old guy, but as soon as it is observed, its state is known.
>
> What do you guys think?  If that doesn't work, I think we're stuck with
> #2, which is expensive but doable.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html