Re: [PATCH 00/11 V1] rbd journaling feature

Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> · Wed, 5 Dec 2018 10:21:28 +0800

On 12/05/2018 01:16 AM, Ilya Dryomov wrote:
On Tue, Dec 4, 2018 at 5:01 PM Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
On Tue, Dec 4, 2018 at 10:16 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
On Tue, Dec 4, 2018 at 3:17 PM Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
On Tue, Dec 4, 2018 at 9:03 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang
<dongsheng.yang@xxxxxxxxxxxx> wrote:
Hi all,
    This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below:

```
[root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh

diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
index e019de5..9d00d3e 100755
--- a/qa/workunits/rbd/rbd_mirror_helpers.sh
+++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
@@ -854,9 +854,9 @@ write_image()

      test -n "${size}" || size=4096

-    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
-       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
-       --io-pattern rand
+    rbd --cluster ${cluster} -p ${pool} map ${image}
+    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1
+    rbd --cluster ${cluster} -p ${pool} unmap ${image}
  }

  stress_write_image()
```

Changelog from RFC:
         1. error out if there is some unsupported event type in replaying
So the journal is still replayed in the kernel.  Was there a design
discussion about this?

Like I said in one of my replies to the RFC, I think we should avoid
replaying the journal in the kernel and try to come up with a design
where it's done by librbd.
+1 to this. If "rbd [device] map" first replays the journal (by just
acquiring the exclusive lock via the API), then krbd would only need
to check that there are no events to replay. If there are one or more
events that it needs to replay for some reason, it implies that it
lost the exclusive lock to another client that changed the image *and*
failed to commit the entries to the journal. I think it seems
reasonable to just move the volume to R/O in that case since something
odd was occurring.
"rbd map" is easy -- we can fail it with a nice error message.  The
real issue is replay on reacquire.  Quoting myself:

   The fundamental problem with replaying the journal in the kernel and
   therefore supporting only a couple of event types is that the journal
   has to be replayed not only at "rbd map" time, but also every time the
   exclusive lock is reacquired.  Whereas we can safely error out at "rbd
   map" time, I don't see a sane way to handle a case where an unsupported
   event type is encountered after reacquiring the lock when the image is
   already mapped.

Consider the following scenario: the kernel gives up the lock for
creating a snapshot, librbd writes SNAP_CREATE event to the journal and
crashes.  Its watch times out, the kernel reacquires the lock but there
is an unsupported entry in the journal.  What should we do?

Marking the device read-only doesn't feel right.  I wouldn't want my
storage to freeze just because a maintenance pod went down.  I think we
need to look into making it so that the kernel can ask librbd to replay
the journal on its behalf.  This will solve the general case, take care
of "rbd map" and also avoid introducing a hard dependency on the tool
for mapping images.
What would be the best mechanism for doing such things? Have a daemon
waiting in user-space listening to a netlink notification from krbd,
or udev notification and user-space trigger, or ...?
As for the communication mechanism, the existing watch-notify on either
the header object or the journal object seems like the natural choice.

As for the execution context, I'm not sure...  If the only reason to
take the performance hit of journaling is to have mirroring and thus
a running rbd-mirror daemon for sending out journal entries, could we
perhaps piggy back on it?

Thanx Ilya and Jason,
      Good idea to me to use librbd for replaying, but I am not sure is
that good to use rbd-mirrror daemon. Or should we introduce a new
daemon in generic way to handle image requests from kernel or others in
watch-notify way.
      In my use case, we always have a rbd-mirror daemon if we have
to use journaling in krbd, so I think rbd-mirror sounds not bad to me.
What's your opinion, Jason? Do you think it's proper to do this work
in rbd-mirror daemon from the point of view on the design of rbd mirroring?

Thanx a lot.

Thanks,

                 Ilya