Re: [Scst-devel] Thin Provisioning and Ceph RBD's

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Tue, 2 Aug 2016 09:49:39 -0400

On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>> Hi Ilya,
>>
>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>> RBD illustration showing RBD ignoring discard until a certain
>>>> threshold - why is that?  This behavior is unfortunately incompatible
>>>> with ESXi discard (UNMAP) behavior.
>>>>
>>>> Is there a way to lower the discard sensitivity on RBD devices?
>>>>
>> <snip>
>>>>
>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>> print SUM/1024 " KB" }'
>>>> 819200 KB
>>>>
>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28
>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>> print SUM/1024 " KB" }'
>>>> 782336 KB
>>>
>>> Think about it in terms of underlying RADOS objects (4M by default).
>>> There are three cases:
>>>
>>>     discard range       | command
>>>     -----------------------------------------
>>>     whole object        | delete
>>>     object's tail       | truncate
>>>     object's head       | zero
>>>
>>> Obviously, only delete and truncate free up space.  In all of your
>>> examples, except the last one, you are attempting to discard the head
>>> of the (first) object.
>>>
>>> You can free up as little as a sector, as long as it's the tail:
>>>
>>> Offset    Length  Type
>>> 0         4194304 data
>>>
>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>
>>> Offset    Length  Type
>>> 0         4193792 data
>>
>> Looks like ESXi is sending in each discard/unmap with the fixed
>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>> is a slight reduction in size via rbd diff method, but now I
>> understand that actual truncate only takes effect when the discard
>> happens to clip the tail of an image.
>>
>> So far looking at
>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>
>> ...the only variable we can control is the count of 8192-sector chunks
>> and not their size.  Which means that most of the ESXi discard
>> commands will be disregarded by Ceph.
>>
>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>
>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>> 1342099456, nr_sects 8192)
>
> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check.
>
> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?)

This seems to reflect the granularity (4194304), which matches the
8192 pages (8192 x 512 = 4194304).  However, there is no alignment
value.

Can discard_alignment be specified with RBD?

>
> 2. Connect to the this iSCSI device from a Linux box and run sg_inq -p 0xB0 /dev/<device>
>
> SCST should correctly report those values for unmap parameters (in blocks).
>
> If in both cases you see correct the same values, then this is VMware issue, because it is ignoring what it is told to do (generate appropriately sized and aligned UNMAP requests). If either Ceph, or SCST doesn't show correct numbers, then the broken party should be fixed.
>
> Vlad
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com