Re: Ceph RBD object-map and discard in VM

Vaibhav Bhembre <vaibhav@xxxxxxxxxxxxxxxx> · Mon, 18 Jul 2016 10:41:58 -0400

Updated the issue with zipped copies of raw LTTng files. Thanks for 
taking a look!

I will also look at fixing the linking issue on librados/ceph-osd side 
and send a PR up.

On 07/18, Jason Dillaman wrote:
Any chance you can zip up the raw LTTng-UST files and attach them to
the ticket? It appears that the rbd-replay-prep tool doesn't record
translate discard events.

The change sounds good to me -- but it would also need to be made in
librados and ceph-osd since I'm sure they would have the same issue.

On Sat, Jul 16, 2016 at 8:48 PM, Vaibhav Bhembre
<vaibhav@xxxxxxxxxxxxxxxx> wrote:
I was finally able to complete the trace. So along with enabling
*rbd_tracing = true* like you adviced I had to symlink *librbd_tp.so* to
point to *librbd_tp.so.1*. Since the SONAME of the library includes the
version number I think we might need to update it in the place it is
referenced from librbd.

https://github.com/ceph/ceph/blob/master/src/librbd/librbd.cc#L58

I have uploaded the traces onto the tracker. Please let me know if there
is anything more I can provide.

Meanwhile, I can also push a fix for the issue with empty traces on
Ubuntu/Debian if you think that change should be fine.

Thanks!

On 07/15, Vaibhav Bhembre wrote:
I enabled rbd_tracing on HV and restarted the guest as to pick the new
configuration up. The change in value of *rbd_tracing* was confirmed from
the admin socket. I am still unable to see any trace.

lsof -p <vm-process-id> does not show *librbd_tp.so* loaded despite multiple
restarts.  Only *librbd.so* seems to be loaded.

No oddities in kern.log are observed.

Let me know if I can provide any other information. Thanks!

On 07/15, Jason Dillaman wrote:
>There appears to be a hole in the documentation.  You know have to set
>a configuration option to enable tracing:
>
>rbd_tracing = true
>
>This will causes librbd.so to dynamically load the tracing module
>librbd_tp.so (which has linkage to LTTng-UST).
>
>On Fri, Jul 15, 2016 at 1:47 PM, Vaibhav Bhembre
><vaibhav@xxxxxxxxxxxxxxxx> wrote:
>>I followed the steps mentioned in [1] but somehow I am unable to see any
>>traces to continue with its step 2. There are no errors seen when performing
>>operations mentioned in step 1. In my setup I am running lttng commands on
>>the HV where my VM has the RBD device attached.
>>
>>My lttng version is as follows:
>>
>>$ lttng --version
>>lttng (LTTng Trace Control) 2.4.0 - Époque Opaque
>>r$ lttng-sessiond --version
>>2.4.0
>>
>>My uname -r looks like follows:
>>Linux infra1node71 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC
>>2015 x86_64 x86_64 x86_64 GNU/Linux
>>
>>The kern.log is clear of any apparmor denials as well.
>>
>>Would I need to have my librbd linked with lttng-ust by any chance? I don't
>>see it linked as seen below:
>>
>>$ ldd /usr/lib/librbd.so.1.0.0 | grep lttng
>>$
>>
>>Any idea what I might be missing here to get lttng running successfully?
>>
>>[1] http://docs.ceph.com/docs/master/rbd/rbd-replay/
>>
>>
>>On 07/14, Jason Dillaman wrote:
>>>
>>>I would probably be able to resolve the issue fairly quickly if it
>>>would be possible for you to provide a RBD replay trace from a slow
>>>and fast mkfs.xfs test run and attach it to the tracker ticket I just
>>>opened for this issue [1]. You can follow the instructions here [2]
>>>but would only need to perform steps 1 and 2 (attaching to output from
>>>step 2 to the ticket).
>>>
>>>Thanks,
>>>
>>>[1] http://tracker.ceph.com/issues/16689
>>>[2] http://docs.ceph.com/docs/master/rbd/rbd-replay/
>>>
>>>On Thu, Jul 14, 2016 at 2:55 PM, Vaibhav Bhembre
>>><vaibhav@xxxxxxxxxxxxxxxx> wrote:
>>>>
>>>>We have been observing this similar behavior. Usually it is the case
>>>>where
>>>>we create a new rbd image, expose it into the guest and perform any
>>>>operation that issues discard to the device.
>>>>
>>>>A typical command that's first run on a given device is mkfs, usually
>>>>with
>>>>discard on.
>>>>
>>>># time mkfs.xfs -s size=4096 -f /dev/sda
>>>>meta-data=/dev/sda               isize=256    agcount=4, agsize=6553600
>>>>blks
>>>>         =                       sectsz=4096  attr=2, projid32bit=0
>>>>data     =                       bsize=4096   blocks=26214400, imaxpct=25
>>>>         =                       sunit=0      swidth=0 blks
>>>>naming   =version 2              bsize=4096   ascii-ci=0
>>>>log      =internal log           bsize=4096   blocks=12800, version=2
>>>>         =                       sectsz=4096  sunit=1 blks, lazy-count=1
>>>>realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>>
>>>>real 9m10.882s
>>>>user 0m0.000s
>>>>sys 0m0.012s
>>>>
>>>>When we issue this same command with object-map feature disabled on the
>>>>image it completes much faster.
>>>>
>>>># time mkfs.xfs -s size=4096 -f /dev/sda
>>>>meta-data=/dev/sda               isize=256    agcount=4, agsize=6553600
>>>>blks
>>>>         =                       sectsz=4096  attr=2, projid32bit=0
>>>>data     =                       bsize=4096   blocks=26214400, imaxpct=25
>>>>         =                       sunit=0      swidth=0 blks
>>>>naming   =version 2              bsize=4096   ascii-ci=0
>>>>log      =internal log           bsize=4096   blocks=12800, version=2
>>>>         =                       sectsz=4096  sunit=1 blks, lazy-count=1
>>>>realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>>
>>>>real 0m1.780s
>>>>user 0m0.000s
>>>>sys 0m0.012s
>>>>
>>>>Also from what I am seeing the slowness seems to be proportional to the
>>>>size
>>>>of the image rather than the amount of data written into it. Issuing mkfs
>>>>without discard doesn't reproduce this issue. The above values were for
>>>>100G
>>>>rbd image. The 250G takes slightly more than twice the time taken for
>>>>100G
>>>>one.
>>>>
>>>># time mkfs.xfs -s size=4096 -f /dev/sda
>>>>meta-data=/dev/sda               isize=256    agcount=4, agsize=16384000
>>>>blks
>>>>         =                       sectsz=4096  attr=2, projid32bit=0
>>>>data     =                       bsize=4096   blocks=65536000, imaxpct=25
>>>>         =                       sunit=0      swidth=0 blks
>>>>naming   =version 2              bsize=4096   ascii-ci=0
>>>>log      =internal log           bsize=4096   blocks=32000, version=2
>>>>         =                       sectsz=4096  sunit=1 blks, lazy-count=1
>>>>realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>>
>>>>real 22m58.076s
>>>>user 0m0.000s
>>>>sys 0m0.024s
>>>>
>>>>Let me know if you need any more information regarding this. We would
>>>>like
>>>>to enable object-map (and fast-diff) on our images once this gets
>>>>resolved.
>>>>
>>>>
>>>>On Wed, Jun 22, 2016 at 5:39 PM, Jason Dillaman <jdillama@xxxxxxxxxx>
>>>>wrote:
>>>>>
>>>>>
>>>>>I'm not sure why I never received the original list email, so I
>>>>>apologize for the delay. Is /dev/sda1, from your example, fresh with
>>>>>no data to actually discard or does it actually have lots of data to
>>>>>discard?
>>>>>
>>>>>Thanks,
>>>>>
>>>>>On Wed, Jun 22, 2016 at 1:56 PM, Brian Andrus <bandrus@xxxxxxxxxx>
>>>>>wrote:
>>>>>> I've created a downstream bug for this same issue.
>>>>>>
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1349116
>>>>>>
>>>>>> On Wed, Jun 15, 2016 at 6:23 AM, <list@xxxxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Hello guys,
>>>>>>>
>>>>>>> We are currently testing Ceph Jewel with object-map feature enabled:
>>>>>>>
>>>>>>> rbd image 'disk-22920':
>>>>>>>         size 102400 MB in 25600 objects
>>>>>>>         order 22 (4096 kB objects)
>>>>>>>         block_name_prefix: rbd_data.7cfa2238e1f29
>>>>>>>         format: 2
>>>>>>>         features: layering, exclusive-lock, object-map, fast-diff,
>>>>>>> deep-flatten
>>>>>>>         flags:
>>>>>>>
>>>>>>> We use this RBD as disk for a kvm virtual machine with virtio-scsi
>>>>>>> and
>>>>>>> discard=unmap. We noticed the following paremeters in /sys/block:
>>>>>>>
>>>>>>> # cat /sys/block/sda/queue/discard_*
>>>>>>> 4096
>>>>>>> 1073741824
>>>>>>> 0 <- discard_zeroes_data
>>>>>>>
>>>>>>> While trying to do a mkfs.ext4 on the disk in VM we noticed a low
>>>>>>> performance with using discard.
>>>>>>>
>>>>>>> mkfs.ext4 -E nodiscard /dev/sda1 - tooks 5 seconds to complete
>>>>>>> mkfs.ext4 -E discard /dev/sda1 - tooks around 3 monutes
>>>>>>>
>>>>>>> When disabling the object-map the mkfs with discard tooks just 5
>>>>>>> seconds.
>>>>>>>
>>>>>>> Do you have any idea what might cause this issue?
>>>>>>>
>>>>>>> Kernel: 4.2.0-35-generic #40~14.04.1-Ubuntu
>>>>>>> Ceph: 10.2.0
>>>>>>> Libvirt: 1.3.1
>>>>>>> QEMU: 2.5.0
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Jonas
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Brian Andrus
>>>>>> Red Hat, Inc.
>>>>>> Storage Consultant, Global Storage Practice
>>>>>> Mobile +1 (530) 903-8487
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>--
>>>>>Jason
>>>>>_______________________________________________
>>>>>ceph-users mailing list
>>>>>ceph-users@xxxxxxxxxxxxxx
>>>>>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>--
>>>Jason
>>
>>
>>--
>>Vaibhav Bhembre
>
>
>
>--
>Jason

--
Vaibhav Bhembre

--
Jason

--
Vaibhav Bhembre
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com