On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote: > Alex Gorbachev wrote on 08/02/2016 07:56 AM: >> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >>>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote: >>>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>>>>> Hi Ilya, >>>>>> >>>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >>>>>>>> RBD illustration showing RBD ignoring discard until a certain >>>>>>>> threshold - why is that? This behavior is unfortunately incompatible >>>>>>>> with ESXi discard (UNMAP) behavior. >>>>>>>> >>>>>>>> Is there a way to lower the discard sensitivity on RBD devices? >>>>>>>> >>>>>> <snip> >>>>>>>> >>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>> print SUM/1024 " KB" }' >>>>>>>> 819200 KB >>>>>>>> >>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28 >>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>> print SUM/1024 " KB" }' >>>>>>>> 782336 KB >>>>>>> >>>>>>> Think about it in terms of underlying RADOS objects (4M by default). >>>>>>> There are three cases: >>>>>>> >>>>>>> discard range | command >>>>>>> ----------------------------------------- >>>>>>> whole object | delete >>>>>>> object's tail | truncate >>>>>>> object's head | zero >>>>>>> >>>>>>> Obviously, only delete and truncate free up space. In all of your >>>>>>> examples, except the last one, you are attempting to discard the head >>>>>>> of the (first) object. >>>>>>> >>>>>>> You can free up as little as a sector, as long as it's the tail: >>>>>>> >>>>>>> Offset Length Type >>>>>>> 0 4194304 data >>>>>>> >>>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>>>>>> >>>>>>> Offset Length Type >>>>>>> 0 4193792 data >>>>>> >>>>>> Looks like ESXi is sending in each discard/unmap with the fixed >>>>>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>>>>> is a slight reduction in size via rbd diff method, but now I >>>>>> understand that actual truncate only takes effect when the discard >>>>>> happens to clip the tail of an image. >>>>>> >>>>>> So far looking at >>>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 >>>>>> >>>>>> ...the only variable we can control is the count of 8192-sector chunks >>>>>> and not their size. Which means that most of the ESXi discard >>>>>> commands will be disregarded by Ceph. >>>>>> >>>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>>>>> >>>>>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >>>>>> 1342099456, nr_sects 8192) >>>>> >>>>> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check. >>>>> >>>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?) >>>> >>>> This seems to reflect the granularity (4194304), which matches the >>>> 8192 pages (8192 x 512 = 4194304). However, there is no alignment >>>> value. >>>> >>>> Can discard_alignment be specified with RBD? >>> >>> It's exported as a read-only sysfs attribute, just like >>> discard_granularity: >>> >>> # cat /sys/block/rbd0/discard_alignment >>> 4194304 >> >> Ah thanks Ilya, it is indeed there. Vlad, your email says to look for >> discard_alignment in /sys/block/<device>/queue, but for RBD it's in >> /sys/block/<device> - could this be the source of the issue? > > No. As you can see below, the alignment reported correctly. So, this must be VMware > issue, because it is ignoring the alignment parameter. You can try to align your VMware > partition on 4M boundary, it might help. Is this not a mismatch: - From sg_inq: Unmap granularity alignment: 8192 - From "cat /sys/block/rbd0/discard_alignment": 4194304 I am compiling the latest SCST trunk now. Thanks, Alex > >> Here is what I get querying the iscsi-exported RBD device on Linux: >> >> root@kio1:/sys/block/sdf# sg_inq -p 0xB0 /dev/sdf >> VPD INQUIRY: Block limits page (SBC) >> Maximum compare and write length: 255 blocks >> Optimal transfer length granularity: 8 blocks >> Maximum transfer length: 16384 blocks >> Optimal transfer length: 1024 blocks >> Maximum prefetch, xdread, xdwrite transfer length: 0 blocks >> Maximum unmap LBA count: 8192 >> Maximum unmap block descriptor count: 4294967295 >> Optimal unmap granularity: 8192 >> Unmap granularity alignment valid: 1 >> Unmap granularity alignment: 8192 > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com