On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: > On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote: >> Alex Gorbachev wrote on 08/02/2016 07:56 AM: >>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >>>>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote: >>>>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>>>>>> Hi Ilya, >>>>>>> >>>>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>> RBD illustration showing RBD ignoring discard until a certain >>>>>>>>> threshold - why is that? This behavior is unfortunately incompatible >>>>>>>>> with ESXi discard (UNMAP) behavior. >>>>>>>>> >>>>>>>>> Is there a way to lower the discard sensitivity on RBD devices? >>>>>>>>> >>>>>>> <snip> >>>>>>>>> >>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>>> print SUM/1024 " KB" }' >>>>>>>>> 819200 KB >>>>>>>>> >>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28 >>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>>> print SUM/1024 " KB" }' >>>>>>>>> 782336 KB >>>>>>>> >>>>>>>> Think about it in terms of underlying RADOS objects (4M by default). >>>>>>>> There are three cases: >>>>>>>> >>>>>>>> discard range | command >>>>>>>> ----------------------------------------- >>>>>>>> whole object | delete >>>>>>>> object's tail | truncate >>>>>>>> object's head | zero >>>>>>>> >>>>>>>> Obviously, only delete and truncate free up space. In all of your >>>>>>>> examples, except the last one, you are attempting to discard the head >>>>>>>> of the (first) object. >>>>>>>> >>>>>>>> You can free up as little as a sector, as long as it's the tail: >>>>>>>> >>>>>>>> Offset Length Type >>>>>>>> 0 4194304 data >>>>>>>> >>>>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>>>>>>> >>>>>>>> Offset Length Type >>>>>>>> 0 4193792 data >>>>>>> >>>>>>> Looks like ESXi is sending in each discard/unmap with the fixed >>>>>>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>>>>>> is a slight reduction in size via rbd diff method, but now I >>>>>>> understand that actual truncate only takes effect when the discard >>>>>>> happens to clip the tail of an image. >>>>>>> >>>>>>> So far looking at >>>>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 >>>>>>> >>>>>>> ...the only variable we can control is the count of 8192-sector chunks >>>>>>> and not their size. Which means that most of the ESXi discard >>>>>>> commands will be disregarded by Ceph. >>>>>>> >>>>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>>>>>> >>>>>>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >>>>>>> 1342099456, nr_sects 8192) >>>>>> >>>>>> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check. >>>>>> >>>>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?) >>>>> >>>>> This seems to reflect the granularity (4194304), which matches the >>>>> 8192 pages (8192 x 512 = 4194304). However, there is no alignment >>>>> value. >>>>> >>>>> Can discard_alignment be specified with RBD? >>>> >>>> It's exported as a read-only sysfs attribute, just like >>>> discard_granularity: >>>> >>>> # cat /sys/block/rbd0/discard_alignment >>>> 4194304 >>> >>> Ah thanks Ilya, it is indeed there. Vlad, your email says to look for >>> discard_alignment in /sys/block/<device>/queue, but for RBD it's in >>> /sys/block/<device> - could this be the source of the issue? >> >> No. As you can see below, the alignment reported correctly. So, this must be VMware >> issue, because it is ignoring the alignment parameter. You can try to align your VMware >> partition on 4M boundary, it might help. > > Is this not a mismatch: > > - From sg_inq: Unmap granularity alignment: 8192 > > - From "cat /sys/block/rbd0/discard_alignment": 4194304 > > I am compiling the latest SCST trunk now. Scratch that, please, I just did a test that shows correct calculation of 4MB in sectors. - On iSCSI client node: dd if=/dev/urandom of=/dev/sdf bs=1M count=800 blkdiscard -o 0 -l 4194304 /dev/sdf - On iSCSI server node: Aug 3 10:50:57 e1 kernel: [ 893.444538] [1381]: vdisk_unmap_range:3832:Discarding (start_sector 0, nr_sects 8192) (8192 * 512 = 4194304) Now proceeding to test discard again with the latest SCST trunk build. > > Thanks, > Alex > >> >>> Here is what I get querying the iscsi-exported RBD device on Linux: >>> >>> root@kio1:/sys/block/sdf# sg_inq -p 0xB0 /dev/sdf >>> VPD INQUIRY: Block limits page (SBC) >>> Maximum compare and write length: 255 blocks >>> Optimal transfer length granularity: 8 blocks >>> Maximum transfer length: 16384 blocks >>> Optimal transfer length: 1024 blocks >>> Maximum prefetch, xdread, xdwrite transfer length: 0 blocks >>> Maximum unmap LBA count: 8192 >>> Maximum unmap block descriptor count: 4294967295 >>> Optimal unmap granularity: 8192 >>> Unmap granularity alignment valid: 1 >>> Unmap granularity alignment: 8192 >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com