Re: [Scst-devel] Thin Provisioning and Ceph RBD's

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Thu, 4 Aug 2016 23:30:59 -0400

On Wed, Aug 3, 2016 at 10:54 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
> On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>> On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
>>> Alex Gorbachev wrote on 08/02/2016 07:56 AM:
>>>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
>>>>>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>>>>>>>> Hi Ilya,
>>>>>>>>
>>>>>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>>>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>> RBD illustration showing RBD ignoring discard until a certain
>>>>>>>>>> threshold - why is that?  This behavior is unfortunately incompatible
>>>>>>>>>> with ESXi discard (UNMAP) behavior.
>>>>>>>>>>
>>>>>>>>>> Is there a way to lower the discard sensitivity on RBD devices?
>>>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>>>
>>>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>>>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>>>>>>> print SUM/1024 " KB" }'
>>>>>>>>>> 819200 KB
>>>>>>>>>>
>>>>>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28
>>>>>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>>>>>>> print SUM/1024 " KB" }'
>>>>>>>>>> 782336 KB
>>>>>>>>>
>>>>>>>>> Think about it in terms of underlying RADOS objects (4M by default).
>>>>>>>>> There are three cases:
>>>>>>>>>
>>>>>>>>>     discard range       | command
>>>>>>>>>     -----------------------------------------
>>>>>>>>>     whole object        | delete
>>>>>>>>>     object's tail       | truncate
>>>>>>>>>     object's head       | zero
>>>>>>>>>
>>>>>>>>> Obviously, only delete and truncate free up space.  In all of your
>>>>>>>>> examples, except the last one, you are attempting to discard the head
>>>>>>>>> of the (first) object.
>>>>>>>>>
>>>>>>>>> You can free up as little as a sector, as long as it's the tail:
>>>>>>>>>
>>>>>>>>> Offset    Length  Type
>>>>>>>>> 0         4194304 data
>>>>>>>>>
>>>>>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>>>>>>>
>>>>>>>>> Offset    Length  Type
>>>>>>>>> 0         4193792 data
>>>>>>>>
>>>>>>>> Looks like ESXi is sending in each discard/unmap with the fixed
>>>>>>>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>>>>>>>> is a slight reduction in size via rbd diff method, but now I
>>>>>>>> understand that actual truncate only takes effect when the discard
>>>>>>>> happens to clip the tail of an image.
>>>>>>>>
>>>>>>>> So far looking at
>>>>>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>>>>>>>
>>>>>>>> ...the only variable we can control is the count of 8192-sector chunks
>>>>>>>> and not their size.  Which means that most of the ESXi discard
>>>>>>>> commands will be disregarded by Ceph.
>>>>>>>>
>>>>>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>>>>>>>
>>>>>>>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>>>>>>>> 1342099456, nr_sects 8192)
>>>>>>>
>>>>>>> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check.
>>>>>>>
>>>>>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?)
>>>>>>
>>>>>> This seems to reflect the granularity (4194304), which matches the
>>>>>> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
>>>>>> value.
>>>>>>
>>>>>> Can discard_alignment be specified with RBD?
>>>>>
>>>>> It's exported as a read-only sysfs attribute, just like
>>>>> discard_granularity:
>>>>>
>>>>> # cat /sys/block/rbd0/discard_alignment
>>>>> 4194304
>>>>
>>>> Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
>>>> discard_alignment in /sys/block/<device>/queue, but for RBD it's in
>>>> /sys/block/<device> - could this be the source of the issue?
>>>
>>> No. As you can see below, the alignment reported correctly. So, this must be VMware
>>> issue, because it is ignoring the alignment parameter. You can try to align your VMware
>>> partition on 4M boundary, it might help.
>>
>> Is this not a mismatch:
>>
>> - From sg_inq: Unmap granularity alignment: 8192
>>
>> - From "cat /sys/block/rbd0/discard_alignment": 4194304
>>
>> I am compiling the latest SCST trunk now.
>
> Scratch that, please, I just did a test that shows correct calculation
> of 4MB in sectors.
>
> - On iSCSI client node:
>
> dd if=/dev/urandom of=/dev/sdf bs=1M count=800
> blkdiscard -o 0 -l 4194304 /dev/sdf
>
> - On iSCSI server node:
>
> Aug  3 10:50:57 e1 kernel: [  893.444538] [1381]:
> vdisk_unmap_range:3832:Discarding (start_sector 0, nr_sects 8192)
>
> (8192 * 512 = 4194304)
>
> Now proceeding to test discard again with the latest SCST trunk build.

I ran the ESXi unmap again with the latest trunk build of SCST, and
still observing the same behavior - although discards do appear to be
aligned on 8192 sectors (4M) and discarding 8192 sector at a time, the
rbd diff is not showing any released space.

The VMFS (standard VMFS5) is aligned on 1M:

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      2147483614       2047M   0700

Is this the problem that the 4M discards are offset by 1M, so none hit
the tail of any object?

>
>
>>
>> Thanks,
>> Alex
>>
>>>
>>>> Here is what I get querying the iscsi-exported RBD device on Linux:
>>>>
>>>> root@kio1:/sys/block/sdf#  sg_inq -p 0xB0 /dev/sdf
>>>> VPD INQUIRY: Block limits page (SBC)
>>>>   Maximum compare and write length: 255 blocks
>>>>   Optimal transfer length granularity: 8 blocks
>>>>   Maximum transfer length: 16384 blocks
>>>>   Optimal transfer length: 1024 blocks
>>>>   Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
>>>>   Maximum unmap LBA count: 8192
>>>>   Maximum unmap block descriptor count: 4294967295
>>>>   Optimal unmap granularity: 8192
>>>>   Unmap granularity alignment valid: 1
>>>>   Unmap granularity alignment: 8192
>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com