Re: Snapshots of consistency groups

Jason Dillaman <jdillama@xxxxxxxxxx> · Sat, 10 Sep 2016 14:37:40 -0400

Those are all internal classes -- the cls types are already
dependencies within the librbd internals. Feel free to add the
necessary include and use it directly from within librbd.

On Fri, Sep 9, 2016 at 6:41 PM, Victor Denisov <vdenisov@xxxxxxxxxxxx> wrote:
> I have a question about where SnapshotNamespace type should be placed.
> I placed it in cls/rbd/cls_rbd_types.h because cls client and cls
> backend components should have access to this type.
> Also this type is required in librbd/Operations.cc - because we want
> to specify in what namespace Operations::snap_create should create
> snapshots.
> However Operations.cc doesn't import cls_rbd_types.h right now. If the
> question was about public interface of librbd/librbd.cc, then I would
> create a duplicate of SnapshotNamespace type in librbd layer without
> hesitation.
> But these functions are internal, so, my question is whether it's
> really feasible to create another type for SnapshotNamespace in librbd
> layer.
>
>
> On Mon, Aug 29, 2016 at 2:10 PM, Victor Denisov <vdenisov@xxxxxxxxxxxx> wrote:
>> Right, I forgot about snaphot "namespaces". I'll add this part.
>> I guess it makes sense to discuss the whole thing on the next CDM.
>>
>> On Sun, Aug 28, 2016 at 5:37 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>>> I think the first step is to implement the concept of snapshot "namespaces".
>>>
>>> This could be implemented as an optional variant structure associated
>>> with each snapshot at creation (see the ImageWatcher RPC messages or
>>> journaling event type encoding for examples of this). For consistency
>>> group snapshots, this structure would identify the snapshot as
>>> belonging to the consistency group and have a unique id back to the
>>> specific group snapshot.
>>>
>>> When creating a snapshot, the state machine would (1) create the group
>>> snapshot record, (2) set the state of the group to "creating snapshot"
>>> (to prevent new images from being added/removed from the group while
>>> the op is in-progress), (3) acquire the lock for all images in the
>>> group, (4) create the individual image snapshots with the linkage back
>>> to the group snapshot record (can be performed in parallel up to max
>>> concurrent ops), (5) release the exclusive locks, and (6) reset the
>>> group status to "ready".
>>>
>>> If you have a hard crash/failure anywhere, a "snap remove" operation
>>> should be designed to get the group back into consistent state (i.e.
>>> remove any snapshots linked to the group and reset the group state
>>> back to ready).
>>>
>>> On Fri, Aug 26, 2016 at 5:05 PM, Victor Denisov <vdenisov@xxxxxxxxxxxx> wrote:
>>>> Guys,
>>>>
>>>> I updated Snapshots section of this document:
>>>> http://pad.ceph.com/p/consistency_groups, in accordance with my
>>>> improved understanding of how it should be implemented.
>>>> Please take a look and provide your comments. Some of my concerns
>>>> regarding the implementation I highlighted in bold.
>>>>
>>>> Looking forward to your valuable remarks.
>>>> Thanks in advance.
>>>> V.
>>>>
>>>>
>>>> On Sat, Aug 20, 2016 at 9:27 AM, Mykola Golub <mgolub@xxxxxxxxxxxx> wrote:
>>>>> On Fri, Aug 19, 2016 at 05:36:56PM -0700, Victor Denisov wrote:
>>>>>> What if I'm holding this lock and somebody else is trying to reacquire the lock.
>>>>>> How do I get notified about it?
>>>>>
>>>>> The image watcher is notified, which triggers its handler:
>>>>>
>>>>>  ImageWatcher<I>::handle_payload(const RequestLockPayload, *ack_ctx)
>>>>>
>>>>> The handler calls the current lock policy method `lock_requested()`,
>>>>> which will define what to do with the lock request. The StandartPolicy
>>>>> is to release the lock, so it may ping-ponging between the
>>>>> clients. You may define a different policy -- rbd-mirror is an example
>>>>> where it is used.
>>>>>
>>>>> Everywhere where an operation needs the exclusive lock, it is always
>>>>> checked if we currently are a lock owner, i.e:
>>>>>
>>>>>  ictx->exclusive_lock->is_lock_owner()
>>>>>
>>>>> and if it is false, the exlusive lock is requested. Before this check
>>>>> you need to aquire ctx->owner_lock, and until you release owner_lock
>>>>> you can be sure your exclusive lock will not leak to another
>>>>> client. After releasing owner_lock, you will need to repeate the check
>>>>> again when you need it.
>>>>>
>>>>> --
>>>>> Mykola Golub
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 5:48 AM, Mykola Golub <mgolub@xxxxxxxxxxxx> wrote:
>>>>>> > On Thu, Aug 18, 2016 at 09:20:02PM -0700, Victor Denisov wrote:
>>>>>> >> Could you please point me to the place in source code where writer
>>>>>> >> acquires an exclusive lock on the image.
>>>>>> >
>>>>>> > Grep for 'exclusive_lock->request_lock'. Basically, what you need
>>>>>> > (after opening the image) is:
>>>>>> >
>>>>>> > ```
>>>>>> >   C_SaferCond lock_ctx;
>>>>>> >   {
>>>>>> >     RWLock::WLocker l(ictx->owner_lock);
>>>>>> >
>>>>>> >     if (ictx->exclusive_lock == nullptr) {
>>>>>> >       // exclusive-lock feature is not enabled
>>>>>> >       return -EINVAL;
>>>>>> >     }
>>>>>> >
>>>>>> >     // Request the lock. If it is currently owned by another client,
>>>>>> >     // RPC message will be sent to the client to release the lock.
>>>>>> >     ictx->exclusive_lock->request_lock(&lock_ctx);
>>>>>> >   } // release owner_lock before waiting to avoid potential deadlock
>>>>>> >
>>>>>> >   int r = lock_ctx.wait();
>>>>>> >   if (r < 0) {
>>>>>> >     return r;
>>>>>> >   }
>>>>>> >
>>>>>> >   RWLock::RLocker l(ictx->owner_lock);
>>>>>> >   if (ictx->exclusive_lock == nullptr || !ictx->exclusive_lock->is_lock_owner()) {
>>>>>> >        // failed to acquire exclusive lock
>>>>>> >        return -EROFS;
>>>>>> >   }
>>>>>> >
>>>>>> >   // At this point lock is acquired
>>>>>> >   ...
>>>>>> >
>>>>>> > ```
>>>>>> >
>>>>>> > You might want to look at this PR
>>>>>> >
>>>>>> >  https://github.com/ceph/ceph/pull/9592
>>>>>> >
>>>>>> > where we discuss adding API methods to directly acquire and release
>>>>>> > the exclusive lock. You don't need the API, but will find examples in
>>>>>> > the patch, and also useful comments from Jason.
>>>>>> >
>>>>>> > --
>>>>>> > Mykola Golub
>>>>>> >
>>>>>> >> I presume you were talking about the feature:
>>>>>> >> exclusive_lock, shared_lock which can be used from command line using
>>>>>> >> commands lock list, lock break.
>>>>>> >>
>>>>>> >> On Thu, Aug 18, 2016 at 5:47 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>>>>>> >> > There is already a "request lock" RPC message and this is already handled
>>>>>> >> > transparently within librbd when you attempt to acquire the lock and another
>>>>>> >> > client owns it.
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > On Thursday, August 18, 2016, Victor Denisov <vdenisov@xxxxxxxxxxxx> wrote:
>>>>>> >> >>
>>>>>> >> >> If an image already has a writer who owns the lock,
>>>>>> >> >> should I implement a notification that allows to ask the writer to
>>>>>> >> >> release the lock,
>>>>>> >> >> is there already a standard way to intercept the exclusive lock?
>>>>>> >> >>
>>>>>> >> >> On Tue, Aug 16, 2016 at 6:29 AM, Jason Dillaman <jdillama@xxxxxxxxxx>
>>>>>> >> >> wrote:
>>>>>> >> >> > ... one more thing:
>>>>>> >> >> >
>>>>>> >> >> > I was also thinking that we need a new RBD feature bit to be used to
>>>>>> >> >> > indicate that an image is part of a consistency group to prevent older
>>>>>> >> >> > librbd clients from removing the image or group snapshots.  This could
>>>>>> >> >> > be a RBD_FEATURES_RW_INCOMPATIBLE feature bit so older clients can
>>>>>> >> >> > still open the image R/O while its part of a group.
>>>>>> >> >> >
>>>>>> >> >> > On Tue, Aug 16, 2016 at 9:26 AM, Jason Dillaman <jdillama@xxxxxxxxxx>
>>>>>> >> >> > wrote:
>>>>>> >> >> >> Way back in April when we had the CDM, I was originally thinking we
>>>>>> >> >> >> should implement option 3. Essentially, you have a prepare group
>>>>>> >> >> >> snapshot RPC message that extends a "paused IO" lease to the caller.
>>>>>> >> >> >> When that lease expires, IO would automatically be resumed even if the
>>>>>> >> >> >> group snapshot hasn't been created yet.  This would also require
>>>>>> >> >> >> commit/abort group snapshot RPC messages.
>>>>>> >> >> >>
>>>>>> >> >> >> However, thinking about this last night, here is another potential
>>>>>> >> >> >> option:
>>>>>> >> >> >>
>>>>>> >> >> >> Option 4 - require images to have the exclusive lock feature before
>>>>>> >> >> >> they can be added to a consistency group (and prevent disabling of
>>>>>> >> >> >> exclusive-lock while they are part of a group). Then librbd, via the
>>>>>> >> >> >> rbd CLI (or client application of the rbd consistency group snap
>>>>>> >> >> >> create API), can co-operatively acquire the lock from all active image
>>>>>> >> >> >> clients within the group (i.e. all IO has been flushed and paused) and
>>>>>> >> >> >> can proceed with snapshot creation. If the rbd CLI dies, the normal
>>>>>> >> >> >> exclusive lock handling process will automatically take care of
>>>>>> >> >> >> re-acquiring the lock from the dead client and resuming IO.
>>>>>> >> >> >>
>>>>>> >> >> >> This option not only re-uses existing code, it would also eliminate
>>>>>> >> >> >> the need to add/update the RPC messages for prepare/commit/abort
>>>>>> >> >> >> snapshot creation to support group snapshots (since it could all be
>>>>>> >> >> >> handled internally).
>>>>>> >> >> >>
>>>>>> >> >> >> On Mon, Aug 15, 2016 at 7:46 PM, Victor Denisov <vdenisov@xxxxxxxxxxxx>
>>>>>> >> >> >> wrote:
>>>>>> >> >> >>> Gentlemen,
>>>>>> >> >> >>>
>>>>>> >> >> >>> I'm writing to you to ask for your opinion regarding quiescing writes.
>>>>>> >> >> >>>
>>>>>> >> >> >>> Here is the situation. In order to take snapshots of all images in a
>>>>>> >> >> >>> consistency group,
>>>>>> >> >> >>> we first need to quiesce all the image writers in the consistency
>>>>>> >> >> >>> group.
>>>>>> >> >> >>> Let me call
>>>>>> >> >> >>> group client - a client which requests a consistency group to take a
>>>>>> >> >> >>> snapshot.
>>>>>> >> >> >>> Image client - the client that writes to an image.
>>>>>> >> >> >>> Let's say group client starts sending notify_quiesce to all image
>>>>>> >> >> >>> clients that write to the images in the group. After quiescing half of
>>>>>> >> >> >>> the image clients the group client can die.
>>>>>> >> >> >>>
>>>>>> >> >> >>> It presents us with a dilemma - what should we do with those quiesced
>>>>>> >> >> >>> image clients.
>>>>>> >> >> >>>
>>>>>> >> >> >>> Option 1 - is to wait till someone manually runs recover for that
>>>>>> >> >> >>> consistency group.
>>>>>> >> >> >>> We can show warning next to those unfinished groups when user runs
>>>>>> >> >> >>> group list command.
>>>>>> >> >> >>> There will be a command like group recover, which allows users to
>>>>>> >> >> >>> rollback unsuccessful snapshots
>>>>>> >> >> >>> or continue them using create snapshot command.
>>>>>> >> >> >>>
>>>>>> >> >> >>> Option 2 - is to establish some heart beats between group client and
>>>>>> >> >> >>> image client. If group client fails to heart beat then image client
>>>>>> >> >> >>> unquiesces itself and continues normal operation.
>>>>>> >> >> >>>
>>>>>> >> >> >>> Option 3 - is to have a timeout for each image client. If group client
>>>>>> >> >> >>> fails to make a group snapshot within this timeout then we resume our
>>>>>> >> >> >>> normal operation informing group client of the fact.
>>>>>> >> >> >>>
>>>>>> >> >> >>> Which of these options do you prefer? Probably there are other options
>>>>>> >> >> >>> that I miss.
>>>>>> >> >> >>>
>>>>>> >> >> >>> Thanks,
>>>>>> >> >> >>> Victor.
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> --
>>>>>> >> >> >> Jason
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > --
>>>>>> >> >> > Jason
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > --
>>>>>> >> > Jason
>>>>>> >> >
>>>>>> >> --
>>>>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Jason

-- 
Jason
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html