On Wed, Aug 3, 2016 at 10:54 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: > On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >> On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote: >>> Alex Gorbachev wrote on 08/02/2016 07:56 AM: >>>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >>>>>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote: >>>>>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>>>>>>> Hi Ilya, >>>>>>>> >>>>>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>> RBD illustration showing RBD ignoring discard until a certain >>>>>>>>>> threshold - why is that? This behavior is unfortunately incompatible >>>>>>>>>> with ESXi discard (UNMAP) behavior. >>>>>>>>>> >>>>>>>>>> Is there a way to lower the discard sensitivity on RBD devices? >>>>>>>>>> >>>>>>>> <snip> >>>>>>>>>> >>>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>>>> print SUM/1024 " KB" }' >>>>>>>>>> 819200 KB >>>>>>>>>> >>>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28 >>>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>>>>>>>>> print SUM/1024 " KB" }' >>>>>>>>>> 782336 KB >>>>>>>>> >>>>>>>>> Think about it in terms of underlying RADOS objects (4M by default). >>>>>>>>> There are three cases: >>>>>>>>> >>>>>>>>> discard range | command >>>>>>>>> ----------------------------------------- >>>>>>>>> whole object | delete >>>>>>>>> object's tail | truncate >>>>>>>>> object's head | zero >>>>>>>>> >>>>>>>>> Obviously, only delete and truncate free up space. In all of your >>>>>>>>> examples, except the last one, you are attempting to discard the head >>>>>>>>> of the (first) object. >>>>>>>>> >>>>>>>>> You can free up as little as a sector, as long as it's the tail: >>>>>>>>> >>>>>>>>> Offset Length Type >>>>>>>>> 0 4194304 data >>>>>>>>> >>>>>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>>>>>>>> >>>>>>>>> Offset Length Type >>>>>>>>> 0 4193792 data >>>>>>>> >>>>>>>> Looks like ESXi is sending in each discard/unmap with the fixed >>>>>>>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>>>>>>> is a slight reduction in size via rbd diff method, but now I >>>>>>>> understand that actual truncate only takes effect when the discard >>>>>>>> happens to clip the tail of an image. >>>>>>>> >>>>>>>> So far looking at >>>>>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513 >>>>>>>> >>>>>>>> ...the only variable we can control is the count of 8192-sector chunks >>>>>>>> and not their size. Which means that most of the ESXi discard >>>>>>>> commands will be disregarded by Ceph. >>>>>>>> >>>>>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>>>>>>> >>>>>>>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >>>>>>>> 1342099456, nr_sects 8192) >>>>>>> >>>>>>> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check. >>>>>>> >>>>>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?) >>>>>> >>>>>> This seems to reflect the granularity (4194304), which matches the >>>>>> 8192 pages (8192 x 512 = 4194304). However, there is no alignment >>>>>> value. >>>>>> >>>>>> Can discard_alignment be specified with RBD? >>>>> >>>>> It's exported as a read-only sysfs attribute, just like >>>>> discard_granularity: >>>>> >>>>> # cat /sys/block/rbd0/discard_alignment >>>>> 4194304 >>>> >>>> Ah thanks Ilya, it is indeed there. Vlad, your email says to look for >>>> discard_alignment in /sys/block/<device>/queue, but for RBD it's in >>>> /sys/block/<device> - could this be the source of the issue? >>> >>> No. As you can see below, the alignment reported correctly. So, this must be VMware >>> issue, because it is ignoring the alignment parameter. You can try to align your VMware >>> partition on 4M boundary, it might help. >> >> Is this not a mismatch: >> >> - From sg_inq: Unmap granularity alignment: 8192 >> >> - From "cat /sys/block/rbd0/discard_alignment": 4194304 >> >> I am compiling the latest SCST trunk now. > > Scratch that, please, I just did a test that shows correct calculation > of 4MB in sectors. > > - On iSCSI client node: > > dd if=/dev/urandom of=/dev/sdf bs=1M count=800 > blkdiscard -o 0 -l 4194304 /dev/sdf > > - On iSCSI server node: > > Aug 3 10:50:57 e1 kernel: [ 893.444538] [1381]: > vdisk_unmap_range:3832:Discarding (start_sector 0, nr_sects 8192) > > (8192 * 512 = 4194304) > > Now proceeding to test discard again with the latest SCST trunk build. I ran the ESXi unmap again with the latest trunk build of SCST, and still observing the same behavior - although discards do appear to be aligned on 8192 sectors (4M) and discarding 8192 sector at a time, the rbd diff is not showing any released space. The VMFS (standard VMFS5) is aligned on 1M: Number Start (sector) End (sector) Size Code Name 1 2048 2147483614 2047M 0700 Is this the problem that the 4M discards are offset by 1M, so none hit the tail of any object? > > >> >> Thanks, >> Alex >> >>> >>>> Here is what I get querying the iscsi-exported RBD device on Linux: >>>> >>>> root@kio1:/sys/block/sdf# sg_inq -p 0xB0 /dev/sdf >>>> VPD INQUIRY: Block limits page (SBC) >>>> Maximum compare and write length: 255 blocks >>>> Optimal transfer length granularity: 8 blocks >>>> Maximum transfer length: 16384 blocks >>>> Optimal transfer length: 1024 blocks >>>> Maximum prefetch, xdread, xdwrite transfer length: 0 blocks >>>> Maximum unmap LBA count: 8192 >>>> Maximum unmap block descriptor count: 4294967295 >>>> Optimal unmap granularity: 8192 >>>> Unmap granularity alignment valid: 1 >>>> Unmap granularity alignment: 8192 >>> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com