On Fri, Aug 17, 2018 at 10:00 AM Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> wrote: > > Hi Ilya, > > On 08/16/2018 09:45 PM, Ilya Dryomov wrote: > > On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang > > <dongsheng.yang@xxxxxxxxxxxx> wrote: > >> Hi all, > >> This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible. > >> > >> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change > >> as below: > >> > >> ``` > >> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh > >> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh > >> index e019de5..9d00d3e 100755 > >> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh > >> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh > >> @@ -854,9 +854,9 @@ write_image() > >> > >> test -n "${size}" || size=4096 > >> > >> - rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \ > >> - --io-size ${size} --io-threads 1 --io-total $((size * count)) \ > >> - --io-pattern rand > >> + rbd --cluster ${cluster} -p ${pool} map ${image} > >> + fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline > >> + rbd --cluster ${cluster} -p ${pool} unmap ${image} > >> } > >> > >> stress_write_image() > >> ``` > >> > >> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most > >> of them are about performance improvement. > >> > >> So I think it's a good timing to ask for comments from all of you guys. > > Hi Dongsheng, > > > > First of all, thanks for taking this on! > > > > I'll skip over the minor technical issues and style nits and focus on > > the three high-level things that immediately jumped at me: > > Great, that's what I want in the RFC round reviewing :) > > > > 1. Have you thought about the consequences of supporting just WRITE and > > DISCARD journal entries in rbd_journal_replay()? The fact that the > > kernel client would only ever emit those two types of entries doesn't > > prevent someone from attempting to map an image with a bunch of > > SNAP_CREATEs or other unsupported entries in the journal. > > > > I think we should avoid replaying the journal in the kernel and try > > to come up with a design where that is done by librbd in userspace. > > In my opinion, we should prevent user to map an image with unsupported > events in > his journal. Maybe return error in start_replay() and fail the map > operation. But the start_replay() > is void now, I think we need an int return-value for it. > > And at the same times, we can output a message to suggest user to do > something before mapping, > such as rbd journal reset. "rbd journal reset" just discards all journal entries, doesn't it? Suggesting something like that would be pretty bad... > > > > 2. You mention some performance-related TODOs, but I don't see anything > > about extra data copies. rbd_journal_append_write_event() copies > > the data payload four (!) times before handing it to the messenger: > > > > - in rbd_journal_append_write_event(), from @bio to @data > > - in rbd_journal_append_write_event(), from @data to @buf > > - in ceph_journaler_object_append(), from @buf to another @buf > > (this time vmalloc'ed?) > > - in ceph_journaler_obj_write_sync(), from @buf to @pages > > Yes, there are times of copies, and I really need put this thing in my > TODO list. thanx for you suggestion. > > > > Since the data is required to be durable in the journal before any > > writes to the image objects are issued, not even reference counting > > is necessary, let alone copying. The messenger should be getting it > > straight from the bio. > > Let me think about how to avoid these copies. > > > > 3. There is a mutex held across network calls in journaler.c. This is > > the case with some of the existing code as well (exclusive lock and > > refresh) and it is a huge pain. Cancelling is almost impossible and > > as I know you know this is what is blocking "rbd unmap -o full-force" > > and rbd-level timeouts. No new such code will be accepted upstream. > > Ah, yes, that's what I was concerning. BTW, Ilya, do you already have a > time-lines for rbd-level timeout feature? I made some headway on refactoring exclusive lock code, but it's nowhere close to completion yet. Thanks, Ilya