Re: No lock on RBD allow several mount on different servers...

Josh Durgin <josh.durgin@xxxxxxxxxxx> · Mon, 13 Aug 2012 10:22:41 -0700

On 08/13/2012 09:55 AM, Gregory Farnum wrote:
We've discussed some of the issues here a little bit before. See
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/7094 if
you're interested.

Josh, can you discuss the current status of the advisory locking?
-Greg

Yehuda reworked it into a generic rados class so it can be used outside
of rbd. It hasn't been re-integrated with rbd yet, and I haven't looked
at it closely since the generalization. Yehuda could describe it in
more detail.

Josh

On Sun, Aug 12, 2012 at 8:44 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
RBD image locking is on roadmap, but it's tricky.  Almost all of the
pieces are in place for exclusive locking of the image header, which will
let the user know when other nodes have the image mapped, and give them
the option to break their lock and take over ownership.

The real challenge is fencing.  Unlike move conventional options like
SCSI, the RBD image is distributed across the entire cluster, so ensuring
that the old guy doesn't still have IOs in flight that will stomp on the
new owner means that potentially everyone needs to be informed that the
bad guy should be locked out.

I think there are a few options:

1- The user has their own fencing or STOGITH on top of rbd, informed by
    the rbd locking.  Pull the plug, update your iptables, whatever.  Not
    very friendly.
2- Extend the rados 'blacklist' functionality to let you ensure that every
    node in the cluster has received the updated osdmap+blacklist
    information, so that you can be sure no further IO from the old guy is
    possible.
3- Use the same approach that ceph-mds fencing uses, in which the old
    owner isn't known to be fenced away from a particular object until the
    new owner reads/touches that object.

My hope is that we can get away with #3, in which case all of the basic
pieces are in place and the real remaining work is integration and
testing.  The logic goes something like this:

File systems write to blocks on disk in a somewhat ordered fashion.
After writing a bunch of data, they approach a 'consistency point' where
their journal and/or superblocks must be flushed and things 'commit' to
disk.  At that point, if the IO fails or blocks, it won't continue to
clobber other parts of the disk.

When an fs in mounted, those same critical areas are read (superblock,
journal, etc.).  The existing client/osd interaction ensures that if
the new guy knows that the old guy is fenced, the act of reading
ensures that the relevant ceph-osds will find out too and that
paticular object will be fenced.

The resulting conclusion is that if a file system (or application on top
of it doing direct io) is sufficiently well-behaved that will be not
corrupt itself when the disk reorders IOs (they do) and issues
barrier/flush operations at the appropriate time (in modern kernels, they
do), then it will work.

I suppose it's roughly analogous to Schroedinger's cat: until the new
owner reads a block, it may or may not still be modified/modifiable by the
old guy, but as soon as it is observed, its state is known.

What do you guys think?  If that doesn't work, I think we're stuck with
#2, which is expensive but doable.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html