Re: [RFC PATCH 0/4] rbd journaling feature

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 16 Aug 2018 15:45:32 +0200

On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang
<dongsheng.yang@xxxxxxxxxxxx> wrote:
>
> Hi all,
>    This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
>
> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change
> as below:
>
> ```
> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> index e019de5..9d00d3e 100755
> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> @@ -854,9 +854,9 @@ write_image()
>
>      test -n "${size}" || size=4096
>
> -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> -       --io-pattern rand
> +    rbd --cluster ${cluster} -p ${pool} map ${image}
> +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline
> +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
>  }
>
>  stress_write_image()
> ```
>
> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most
> of them are about performance improvement.
>
> So I think it's a good timing to ask for comments from all of you guys.

Hi Dongsheng,

First of all, thanks for taking this on!

I'll skip over the minor technical issues and style nits and focus on
the three high-level things that immediately jumped at me:

1. Have you thought about the consequences of supporting just WRITE and
   DISCARD journal entries in rbd_journal_replay()?  The fact that the
   kernel client would only ever emit those two types of entries doesn't
   prevent someone from attempting to map an image with a bunch of
   SNAP_CREATEs or other unsupported entries in the journal.

   I think we should avoid replaying the journal in the kernel and try
   to come up with a design where that is done by librbd in userspace.

2. You mention some performance-related TODOs, but I don't see anything
   about extra data copies.  rbd_journal_append_write_event() copies
   the data payload four (!) times before handing it to the messenger:

   - in rbd_journal_append_write_event(), from @bio to @data
   - in rbd_journal_append_write_event(), from @data to @buf
   - in ceph_journaler_object_append(), from @buf to another @buf
     (this time vmalloc'ed?)
   - in ceph_journaler_obj_write_sync(), from @buf to @pages

   Since the data is required to be durable in the journal before any
   writes to the image objects are issued, not even reference counting
   is necessary, let alone copying.  The messenger should be getting it
   straight from the bio.

3. There is a mutex held across network calls in journaler.c.  This is
   the case with some of the existing code as well (exclusive lock and
   refresh) and it is a huge pain.  Cancelling is almost impossible and
   as I know you know this is what is blocking "rbd unmap -o full-force"
   and rbd-level timeouts.  No new such code will be accepted upstream.

Thanks,

                Ilya