On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote: >>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>>> Hi Ilya, >>>> >>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >>>>>> RBD illustration showing RBD ignoring discard until a certain >>>>>> threshold - why is that? This behavior is unfortunately incompatible >>>>>> with ESXi discard (UNMAP) behavior. >>>>>> >>>>>> Is there a way to lower the discard sensitivity on RBD devices? >>>>>> >>>> <snip> >>>>>> >>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>> print SUM/1024 " KB" }' >>>>>> 819200 KB >>>>>> >>>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28 >>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>> print SUM/1024 " KB" }' >>>>>> 782336 KB >>>>> >>>>> Think about it in terms of underlying RADOS objects (4M by default). >>>>> There are three cases: >>>>> >>>>> discard range | command >>>>> ----------------------------------------- >>>>> whole object | delete >>>>> object's tail | truncate >>>>> object's head | zero >>>>> >>>>> Obviously, only delete and truncate free up space. In all of your >>>>> examples, except the last one, you are attempting to discard the head >>>>> of the (first) object. >>>>> >>>>> You can free up as little as a sector, as long as it's the tail: >>>>> >>>>> Offset Length Type >>>>> 0 4194304 data >>>>> >>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>>>> >>>>> Offset Length Type >>>>> 0 4193792 data >>>> >>>> Looks like ESXi is sending in each discard/unmap with the fixed >>>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>>> is a slight reduction in size via rbd diff method, but now I >>>> understand that actual truncate only takes effect when the discard >>>> happens to clip the tail of an image. >>>> >>>> So far looking at >>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 >>>> >>>> ...the only variable we can control is the count of 8192-sector chunks >>>> and not their size. Which means that most of the ESXi discard >>>> commands will be disregarded by Ceph. >>>> >>>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>>> >>>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >>>> 1342099456, nr_sects 8192) >>> >>> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check. >>> >>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?) >> >> This seems to reflect the granularity (4194304), which matches the >> 8192 pages (8192 x 512 = 4194304). However, there is no alignment >> value. >> >> Can discard_alignment be specified with RBD? > > It's exported as a read-only sysfs attribute, just like > discard_granularity: > > # cat /sys/block/rbd0/discard_alignment > 4194304 Ah thanks Ilya, it is indeed there. Vlad, your email says to look for discard_alignment in /sys/block/<device>/queue, but for RBD it's in /sys/block/<device> - could this be the source of the issue? Here is what I get querying the iscsi-exported RBD device on Linux: root@kio1:/sys/block/sdf# sg_inq -p 0xB0 /dev/sdf VPD INQUIRY: Block limits page (SBC) Maximum compare and write length: 255 blocks Optimal transfer length granularity: 8 blocks Maximum transfer length: 16384 blocks Optimal transfer length: 1024 blocks Maximum prefetch, xdread, xdwrite transfer length: 0 blocks Maximum unmap LBA count: 8192 Maximum unmap block descriptor count: 4294967295 Optimal unmap granularity: 8192 Unmap granularity alignment valid: 1 Unmap granularity alignment: 8192 > > Thanks, > > Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com