Re: [Scst-devel] Thin Provisioning and Ceph RBD's

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Wed, 3 Aug 2016 10:54:13 -0400

On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
>> Alex Gorbachev wrote on 08/02/2016 07:56 AM:
>>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
>>>>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>>>>>>> Hi Ilya,
>>>>>>>
>>>>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>> RBD illustration showing RBD ignoring discard until a certain
>>>>>>>>> threshold - why is that?  This behavior is unfortunately incompatible
>>>>>>>>> with ESXi discard (UNMAP) behavior.
>>>>>>>>>
>>>>>>>>> Is there a way to lower the discard sensitivity on RBD devices?
>>>>>>>>>
>>>>>>> <snip>
>>>>>>>>>
>>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>>>>>> print SUM/1024 " KB" }'
>>>>>>>>> 819200 KB
>>>>>>>>>
>>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28
>>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>>>>>> print SUM/1024 " KB" }'
>>>>>>>>> 782336 KB
>>>>>>>>
>>>>>>>> Think about it in terms of underlying RADOS objects (4M by default).
>>>>>>>> There are three cases:
>>>>>>>>
>>>>>>>>     discard range       | command
>>>>>>>>     -----------------------------------------
>>>>>>>>     whole object        | delete
>>>>>>>>     object's tail       | truncate
>>>>>>>>     object's head       | zero
>>>>>>>>
>>>>>>>> Obviously, only delete and truncate free up space.  In all of your
>>>>>>>> examples, except the last one, you are attempting to discard the head
>>>>>>>> of the (first) object.
>>>>>>>>
>>>>>>>> You can free up as little as a sector, as long as it's the tail:
>>>>>>>>
>>>>>>>> Offset    Length  Type
>>>>>>>> 0         4194304 data
>>>>>>>>
>>>>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>>>>>>
>>>>>>>> Offset    Length  Type
>>>>>>>> 0         4193792 data
>>>>>>>
>>>>>>> Looks like ESXi is sending in each discard/unmap with the fixed
>>>>>>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>>>>>>> is a slight reduction in size via rbd diff method, but now I
>>>>>>> understand that actual truncate only takes effect when the discard
>>>>>>> happens to clip the tail of an image.
>>>>>>>
>>>>>>> So far looking at
>>>>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>>>>>>
>>>>>>> ...the only variable we can control is the count of 8192-sector chunks
>>>>>>> and not their size.  Which means that most of the ESXi discard
>>>>>>> commands will be disregarded by Ceph.
>>>>>>>
>>>>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>>>>>>
>>>>>>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>>>>>>> 1342099456, nr_sects 8192)
>>>>>>
>>>>>> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check.
>>>>>>
>>>>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?)
>>>>>
>>>>> This seems to reflect the granularity (4194304), which matches the
>>>>> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
>>>>> value.
>>>>>
>>>>> Can discard_alignment be specified with RBD?
>>>>
>>>> It's exported as a read-only sysfs attribute, just like
>>>> discard_granularity:
>>>>
>>>> # cat /sys/block/rbd0/discard_alignment
>>>> 4194304
>>>
>>> Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
>>> discard_alignment in /sys/block/<device>/queue, but for RBD it's in
>>> /sys/block/<device> - could this be the source of the issue?
>>
>> No. As you can see below, the alignment reported correctly. So, this must be VMware
>> issue, because it is ignoring the alignment parameter. You can try to align your VMware
>> partition on 4M boundary, it might help.
>
> Is this not a mismatch:
>
> - From sg_inq: Unmap granularity alignment: 8192
>
> - From "cat /sys/block/rbd0/discard_alignment": 4194304
>
> I am compiling the latest SCST trunk now.

Scratch that, please, I just did a test that shows correct calculation
of 4MB in sectors.

- On iSCSI client node:

dd if=/dev/urandom of=/dev/sdf bs=1M count=800
blkdiscard -o 0 -l 4194304 /dev/sdf

- On iSCSI server node:

Aug  3 10:50:57 e1 kernel: [  893.444538] [1381]:
vdisk_unmap_range:3832:Discarding (start_sector 0, nr_sects 8192)

(8192 * 512 = 4194304)

Now proceeding to test discard again with the latest SCST trunk build.

>
> Thanks,
> Alex
>
>>
>>> Here is what I get querying the iscsi-exported RBD device on Linux:
>>>
>>> root@kio1:/sys/block/sdf#  sg_inq -p 0xB0 /dev/sdf
>>> VPD INQUIRY: Block limits page (SBC)
>>>   Maximum compare and write length: 255 blocks
>>>   Optimal transfer length granularity: 8 blocks
>>>   Maximum transfer length: 16384 blocks
>>>   Optimal transfer length: 1024 blocks
>>>   Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
>>>   Maximum unmap LBA count: 8192
>>>   Maximum unmap block descriptor count: 4294967295
>>>   Optimal unmap granularity: 8192
>>>   Unmap granularity alignment valid: 1
>>>   Unmap granularity alignment: 8192
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com