Re: [Scst-devel] Thin Provisioning and Ceph RBD's

Ilya Dryomov <idryomov@xxxxxxxxx> · Tue, 2 Aug 2016 10:36:36 +0200

On Tue, Aug 2, 2016 at 1:05 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
> Hi Ilya,
>
> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>> RBD illustration showing RBD ignoring discard until a certain
>>> threshold - why is that?  This behavior is unfortunately incompatible
>>> with ESXi discard (UNMAP) behavior.
>>>
>>> Is there a way to lower the discard sensitivity on RBD devices?
>>>
> <snip>
>>>
>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>> print SUM/1024 " KB" }'
>>> 819200 KB
>>>
>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28
>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>> print SUM/1024 " KB" }'
>>> 782336 KB
>>
>> Think about it in terms of underlying RADOS objects (4M by default).
>> There are three cases:
>>
>>     discard range       | command
>>     -----------------------------------------
>>     whole object        | delete
>>     object's tail       | truncate
>>     object's head       | zero
>>
>> Obviously, only delete and truncate free up space.  In all of your
>> examples, except the last one, you are attempting to discard the head
>> of the (first) object.
>>
>> You can free up as little as a sector, as long as it's the tail:
>>
>> Offset    Length  Type
>> 0         4194304 data
>>
>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>
>> Offset    Length  Type
>> 0         4193792 data
>
> Looks like ESXi is sending in each discard/unmap with the fixed
> granularity of 8192 sectors, which is passed verbatim by SCST.  There
> is a slight reduction in size via rbd diff method, but now I
> understand that actual truncate only takes effect when the discard
> happens to clip the tail of an image.

... the tail of the *object*.  And again, with "filestore punch hole
= true", page-sized discards anywhere within the image would free up
space, but "rbd diff" won't reflect that.

>
> So far looking at
> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>
> ...the only variable we can control is the count of 8192-sector chunks
> and not their size.  Which means that most of the ESXi discard
> commands will be disregarded by Ceph.
>
> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>
> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
> 1342099456, nr_sects 8192)

They won't be disregarded, but it would definitely work better if they
were aligned.  1342099456 isn't 4M-aligned.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com