Re: [PATCH 00/11 V1] rbd journaling feature

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 4 Dec 2018 11:01:45 -0500

On Tue, Dec 4, 2018 at 10:16 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> On Tue, Dec 4, 2018 at 3:17 PM Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
> >
> > On Tue, Dec 4, 2018 at 9:03 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > >
> > > On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang
> > > <dongsheng.yang@xxxxxxxxxxxx> wrote:
> > > >
> > > > Hi all,
> > > >    This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
> > > > It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below:
> > > >
> > > > ```
> > > > [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > index e019de5..9d00d3e 100755
> > > > --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > @@ -854,9 +854,9 @@ write_image()
> > > >
> > > >      test -n "${size}" || size=4096
> > > >
> > > > -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> > > > -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> > > > -       --io-pattern rand
> > > > +    rbd --cluster ${cluster} -p ${pool} map ${image}
> > > > +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1
> > > > +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
> > > >  }
> > > >
> > > >  stress_write_image()
> > > > ```
> > > >
> > > > Changelog from RFC:
> > > >         1. error out if there is some unsupported event type in replaying
> > >
> > > So the journal is still replayed in the kernel.  Was there a design
> > > discussion about this?
> > >
> > > Like I said in one of my replies to the RFC, I think we should avoid
> > > replaying the journal in the kernel and try to come up with a design
> > > where it's done by librbd.
> >
> > +1 to this. If "rbd [device] map" first replays the journal (by just
> > acquiring the exclusive lock via the API), then krbd would only need
> > to check that there are no events to replay. If there are one or more
> > events that it needs to replay for some reason, it implies that it
> > lost the exclusive lock to another client that changed the image *and*
> > failed to commit the entries to the journal. I think it seems
> > reasonable to just move the volume to R/O in that case since something
> > odd was occurring.
>
> "rbd map" is easy -- we can fail it with a nice error message.  The
> real issue is replay on reacquire.  Quoting myself:
>
>   The fundamental problem with replaying the journal in the kernel and
>   therefore supporting only a couple of event types is that the journal
>   has to be replayed not only at "rbd map" time, but also every time the
>   exclusive lock is reacquired.  Whereas we can safely error out at "rbd
>   map" time, I don't see a sane way to handle a case where an unsupported
>   event type is encountered after reacquiring the lock when the image is
>   already mapped.
>
> Consider the following scenario: the kernel gives up the lock for
> creating a snapshot, librbd writes SNAP_CREATE event to the journal and
> crashes.  Its watch times out, the kernel reacquires the lock but there
> is an unsupported entry in the journal.  What should we do?
>
> Marking the device read-only doesn't feel right.  I wouldn't want my
> storage to freeze just because a maintenance pod went down.  I think we
> need to look into making it so that the kernel can ask librbd to replay
> the journal on its behalf.  This will solve the general case, take care
> of "rbd map" and also avoid introducing a hard dependency on the tool
> for mapping images.

What would be the best mechanism for doing such things? Have a daemon
waiting in user-space listening to a netlink notification from krbd,
or udev notification and user-space trigger, or ...?

> Thanks,
>
>                 Ilya

--
Jason