On Tue, Dec 4, 2018 at 10:16 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Tue, Dec 4, 2018 at 3:17 PM Jason Dillaman <jdillama@xxxxxxxxxx> wrote: > > > > On Tue, Dec 4, 2018 at 9:03 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > > > On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang > > > <dongsheng.yang@xxxxxxxxxxxx> wrote: > > > > > > > > Hi all, > > > > This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible. > > > > It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below: > > > > > > > > ``` > > > > [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh > > > > diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh > > > > index e019de5..9d00d3e 100755 > > > > --- a/qa/workunits/rbd/rbd_mirror_helpers.sh > > > > +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh > > > > @@ -854,9 +854,9 @@ write_image() > > > > > > > > test -n "${size}" || size=4096 > > > > > > > > - rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \ > > > > - --io-size ${size} --io-threads 1 --io-total $((size * count)) \ > > > > - --io-pattern rand > > > > + rbd --cluster ${cluster} -p ${pool} map ${image} > > > > + fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1 > > > > + rbd --cluster ${cluster} -p ${pool} unmap ${image} > > > > } > > > > > > > > stress_write_image() > > > > ``` > > > > > > > > Changelog from RFC: > > > > 1. error out if there is some unsupported event type in replaying > > > > > > So the journal is still replayed in the kernel. Was there a design > > > discussion about this? > > > > > > Like I said in one of my replies to the RFC, I think we should avoid > > > replaying the journal in the kernel and try to come up with a design > > > where it's done by librbd. > > > > +1 to this. If "rbd [device] map" first replays the journal (by just > > acquiring the exclusive lock via the API), then krbd would only need > > to check that there are no events to replay. If there are one or more > > events that it needs to replay for some reason, it implies that it > > lost the exclusive lock to another client that changed the image *and* > > failed to commit the entries to the journal. I think it seems > > reasonable to just move the volume to R/O in that case since something > > odd was occurring. > > "rbd map" is easy -- we can fail it with a nice error message. The > real issue is replay on reacquire. Quoting myself: > > The fundamental problem with replaying the journal in the kernel and > therefore supporting only a couple of event types is that the journal > has to be replayed not only at "rbd map" time, but also every time the > exclusive lock is reacquired. Whereas we can safely error out at "rbd > map" time, I don't see a sane way to handle a case where an unsupported > event type is encountered after reacquiring the lock when the image is > already mapped. > > Consider the following scenario: the kernel gives up the lock for > creating a snapshot, librbd writes SNAP_CREATE event to the journal and > crashes. Its watch times out, the kernel reacquires the lock but there > is an unsupported entry in the journal. What should we do? > > Marking the device read-only doesn't feel right. I wouldn't want my > storage to freeze just because a maintenance pod went down. I think we > need to look into making it so that the kernel can ask librbd to replay > the journal on its behalf. This will solve the general case, take care > of "rbd map" and also avoid introducing a hard dependency on the tool > for mapping images. What would be the best mechanism for doing such things? Have a daemon waiting in user-space listening to a netlink notification from krbd, or udev notification and user-space trigger, or ...? > Thanks, > > Ilya -- Jason