Re: [Scst-devel] Thin Provisioning and Ceph RBD's

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Wed, 3 Aug 2016 09:59:36 -0400

On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
> Alex Gorbachev wrote on 08/02/2016 07:56 AM:
>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
>>>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>>>>>> Hi Ilya,
>>>>>>
>>>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>> RBD illustration showing RBD ignoring discard until a certain
>>>>>>>> threshold - why is that?  This behavior is unfortunately incompatible
>>>>>>>> with ESXi discard (UNMAP) behavior.
>>>>>>>>
>>>>>>>> Is there a way to lower the discard sensitivity on RBD devices?
>>>>>>>>
>>>>>> <snip>
>>>>>>>>
>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>>>>> print SUM/1024 " KB" }'
>>>>>>>> 819200 KB
>>>>>>>>
>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28
>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>>>>> print SUM/1024 " KB" }'
>>>>>>>> 782336 KB
>>>>>>>
>>>>>>> Think about it in terms of underlying RADOS objects (4M by default).
>>>>>>> There are three cases:
>>>>>>>
>>>>>>>     discard range       | command
>>>>>>>     -----------------------------------------
>>>>>>>     whole object        | delete
>>>>>>>     object's tail       | truncate
>>>>>>>     object's head       | zero
>>>>>>>
>>>>>>> Obviously, only delete and truncate free up space.  In all of your
>>>>>>> examples, except the last one, you are attempting to discard the head
>>>>>>> of the (first) object.
>>>>>>>
>>>>>>> You can free up as little as a sector, as long as it's the tail:
>>>>>>>
>>>>>>> Offset    Length  Type
>>>>>>> 0         4194304 data
>>>>>>>
>>>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>>>>>
>>>>>>> Offset    Length  Type
>>>>>>> 0         4193792 data
>>>>>>
>>>>>> Looks like ESXi is sending in each discard/unmap with the fixed
>>>>>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>>>>>> is a slight reduction in size via rbd diff method, but now I
>>>>>> understand that actual truncate only takes effect when the discard
>>>>>> happens to clip the tail of an image.
>>>>>>
>>>>>> So far looking at
>>>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>>>>>
>>>>>> ...the only variable we can control is the count of 8192-sector chunks
>>>>>> and not their size.  Which means that most of the ESXi discard
>>>>>> commands will be disregarded by Ceph.
>>>>>>
>>>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>>>>>
>>>>>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>>>>>> 1342099456, nr_sects 8192)
>>>>>
>>>>> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check.
>>>>>
>>>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?)
>>>>
>>>> This seems to reflect the granularity (4194304), which matches the
>>>> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
>>>> value.
>>>>
>>>> Can discard_alignment be specified with RBD?
>>>
>>> It's exported as a read-only sysfs attribute, just like
>>> discard_granularity:
>>>
>>> # cat /sys/block/rbd0/discard_alignment
>>> 4194304
>>
>> Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
>> discard_alignment in /sys/block/<device>/queue, but for RBD it's in
>> /sys/block/<device> - could this be the source of the issue?
>
> No. As you can see below, the alignment reported correctly. So, this must be VMware
> issue, because it is ignoring the alignment parameter. You can try to align your VMware
> partition on 4M boundary, it might help.

Is this not a mismatch:

- From sg_inq: Unmap granularity alignment: 8192

- From "cat /sys/block/rbd0/discard_alignment": 4194304

I am compiling the latest SCST trunk now.

Thanks,
Alex

>
>> Here is what I get querying the iscsi-exported RBD device on Linux:
>>
>> root@kio1:/sys/block/sdf#  sg_inq -p 0xB0 /dev/sdf
>> VPD INQUIRY: Block limits page (SBC)
>>   Maximum compare and write length: 255 blocks
>>   Optimal transfer length granularity: 8 blocks
>>   Maximum transfer length: 16384 blocks
>>   Optimal transfer length: 1024 blocks
>>   Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
>>   Maximum unmap LBA count: 8192
>>   Maximum unmap block descriptor count: 4294967295
>>   Optimal unmap granularity: 8192
>>   Unmap granularity alignment valid: 1
>>   Unmap granularity alignment: 8192
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com