Re: [RFC PATCH 0/4] rbd journaling feature

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 17 Aug 2018 11:30:15 +0200

On Fri, Aug 17, 2018 at 10:00 AM Dongsheng Yang
<dongsheng.yang@xxxxxxxxxxxx> wrote:
>
> Hi Ilya,
>
> On 08/16/2018 09:45 PM, Ilya Dryomov wrote:
> > On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang
> > <dongsheng.yang@xxxxxxxxxxxx> wrote:
> >> Hi all,
> >>     This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
> >>
> >> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change
> >> as below:
> >>
> >> ```
> >> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> >> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> >> index e019de5..9d00d3e 100755
> >> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> >> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> >> @@ -854,9 +854,9 @@ write_image()
> >>
> >>       test -n "${size}" || size=4096
> >>
> >> -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> >> -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> >> -       --io-pattern rand
> >> +    rbd --cluster ${cluster} -p ${pool} map ${image}
> >> +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline
> >> +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
> >>   }
> >>
> >>   stress_write_image()
> >> ```
> >>
> >> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most
> >> of them are about performance improvement.
> >>
> >> So I think it's a good timing to ask for comments from all of you guys.
> > Hi Dongsheng,
> >
> > First of all, thanks for taking this on!
> >
> > I'll skip over the minor technical issues and style nits and focus on
> > the three high-level things that immediately jumped at me:
>
> Great, that's what I want in the RFC round reviewing :)
> >
> > 1. Have you thought about the consequences of supporting just WRITE and
> >     DISCARD journal entries in rbd_journal_replay()?  The fact that the
> >     kernel client would only ever emit those two types of entries doesn't
> >     prevent someone from attempting to map an image with a bunch of
> >     SNAP_CREATEs or other unsupported entries in the journal.
> >
> >     I think we should avoid replaying the journal in the kernel and try
> >     to come up with a design where that is done by librbd in userspace.
>
> In my opinion,  we should prevent user to map an image with unsupported
> events in
> his journal. Maybe return error in start_replay() and fail the map
> operation. But the start_replay()
> is void now, I think we need an int return-value for it.
>
> And at the same times, we can output a message to suggest user to do
> something before mapping,
> such as rbd journal reset.

"rbd journal reset" just discards all journal entries, doesn't it?
Suggesting something like that would be pretty bad...

> >
> > 2. You mention some performance-related TODOs, but I don't see anything
> >     about extra data copies.  rbd_journal_append_write_event() copies
> >     the data payload four (!) times before handing it to the messenger:
> >
> >     - in rbd_journal_append_write_event(), from @bio to @data
> >     - in rbd_journal_append_write_event(), from @data to @buf
> >     - in ceph_journaler_object_append(), from @buf to another @buf
> >       (this time vmalloc'ed?)
> >     - in ceph_journaler_obj_write_sync(), from @buf to @pages
>
> Yes, there are times of copies, and I really need put this thing in my
> TODO list. thanx for you suggestion.
> >
> >     Since the data is required to be durable in the journal before any
> >     writes to the image objects are issued, not even reference counting
> >     is necessary, let alone copying.  The messenger should be getting it
> >     straight from the bio.
>
> Let me think about how to avoid these copies.
> >
> > 3. There is a mutex held across network calls in journaler.c.  This is
> >     the case with some of the existing code as well (exclusive lock and
> >     refresh) and it is a huge pain.  Cancelling is almost impossible and
> >     as I know you know this is what is blocking "rbd unmap -o full-force"
> >     and rbd-level timeouts.  No new such code will be accepted upstream.
>
> Ah, yes, that's what I was concerning. BTW, Ilya, do you already have a
> time-lines for rbd-level timeout feature?

I made some headway on refactoring exclusive lock code, but it's
nowhere close to completion yet.

Thanks,

                Ilya