Re: RBD Exclusive lock to shared lock

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 25 Mar 2022 18:20:18 +0100

On Fri, Mar 25, 2022 at 4:11 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> On Thu, Mar 24, 2022 at 2:04 PM Budai Laszlo <laszlo.budai@xxxxxxxxx> wrote:
> >
> > Hi Ilya,
> >
> > Thank you for your answer!
> >
> > On 3/24/22 14:09, Ilya Dryomov wrote:
> >
> >
> > How can we see whether a lock is exclusive or shared? the rbd lock ls command output looks identical for the two cases.
> >
> > You can't.  The way --exclusive is implemented is the client simply
> > refuses to release the lock when it gets the request to do so.  This
> > isn't tracked on the OSD side in any way so "rbd lock ls" doesn't have
> > that information.
> >
> >
> > if I understand correctly then the lock itself is an OSD "flag" but whether is treated as shared or exclusive that is a local decision of the client. Is this correct?
>
> Hi Laszlo,
>
> Not entirely.  There are two orthogonal concepts: shared vs exclusive
> and managed vs unmanaged.
>
> The distinction between shared and exclusive is what you would expect:
> a shared lock can be held by multiple clients at the same time (as long
> as they all use the same lock tag -- a free-form string).  An exclusive
> lock can only be held by a single client at a time.
>
> Managed vs unmanaged refers to whether librbd is involved.  For the
> managed case, if an image is opened in read-write mode, librbd ensures
> that a lock is taken before proceeding with any write (and in certain
> cases before proceeding with any read as well).  If the lock is owned
> by another client at that time, it is transparently requested and,
> unless the other client is in the poorly named --exclusive mode, the
> lock is eventually transitioned behind the scenes.  A managed lock
> doesn't prevent two clients from writing to the same image: it's sole
> purpose is to prevent them from doing that at _exactly_ the same
> moment in time.  The use case is protecting RBD image's internal
> metadata, such as the object map, from concurrent modifications.
>
> For the unmanaged case, everything is up to the user.  It is completely
> external to librbd, meaning that librbd would happily scribble over the
> image if the user doesn't check on the lock before mapping the image or
> starting some operation.  The use case is providing a building block
> for users building their own orchestration on top of RBD.
>
> The matrix is as follows:
>
> - unmanaged/exclusive           "rbd lock add"
>
> - unmanaged/shared              "rbd lock add --shared"
>
> - managed/exclusive with        exclusive-lock image feature
>   automatic transitions
>
> - managed/exclusive without     exclusive-lock image feature
>   automatic transitions         with --exclusive mapping option
>
> - managed/shared                technically possible but not
>                                 surfaced to the user
>
> >
> > If my previous understanding is correct then I assume that it would not be impossible to modify the client code so it can be configured on the fly how to handle lock release requests.
>
> Not impossible, but pretty hard...
>
> >
> > My use case would be a HA cluster where a VM is mapping an rbd image, and then it encounters some network issue. An other node of the HA cluster could start the VM and map again the image, but if the networking is fixed on the first VM that would keep using the already mapped image. Here If I could instruct my second VM to treat the lock as exclusive after an automatic failover, then I'm protected against data corruption when the networking of initial VM is fixed. But I assume that a STONITH kind of fencing could also do the job (if it can be implemented).
>
> I would suggest using unmanaged locks here -- this is exactly what
> they are for.

Actually, if the above is all you need and STONITH at the Ceph level is
sufficient, you could use --exclusive as it is today.  If you always map
with --exclusive on all nodes, then for the scenario you are describing,
the mapping on the second node would automatically fence the mapping on
the first node (or, if the first node is alive, the mapping would fail).

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx