Re: [Scst-devel] Thin Provisioning and Ceph RBD's

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Tue, 2 Aug 2016 10:56:22 -0400

On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>>>> Hi Ilya,
>>>>
>>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>>> RBD illustration showing RBD ignoring discard until a certain
>>>>>> threshold - why is that?  This behavior is unfortunately incompatible
>>>>>> with ESXi discard (UNMAP) behavior.
>>>>>>
>>>>>> Is there a way to lower the discard sensitivity on RBD devices?
>>>>>>
>>>> <snip>
>>>>>>
>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>>> print SUM/1024 " KB" }'
>>>>>> 819200 KB
>>>>>>
>>>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28
>>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>>> print SUM/1024 " KB" }'
>>>>>> 782336 KB
>>>>>
>>>>> Think about it in terms of underlying RADOS objects (4M by default).
>>>>> There are three cases:
>>>>>
>>>>>     discard range       | command
>>>>>     -----------------------------------------
>>>>>     whole object        | delete
>>>>>     object's tail       | truncate
>>>>>     object's head       | zero
>>>>>
>>>>> Obviously, only delete and truncate free up space.  In all of your
>>>>> examples, except the last one, you are attempting to discard the head
>>>>> of the (first) object.
>>>>>
>>>>> You can free up as little as a sector, as long as it's the tail:
>>>>>
>>>>> Offset    Length  Type
>>>>> 0         4194304 data
>>>>>
>>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>>>
>>>>> Offset    Length  Type
>>>>> 0         4193792 data
>>>>
>>>> Looks like ESXi is sending in each discard/unmap with the fixed
>>>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>>>> is a slight reduction in size via rbd diff method, but now I
>>>> understand that actual truncate only takes effect when the discard
>>>> happens to clip the tail of an image.
>>>>
>>>> So far looking at
>>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>>>
>>>> ...the only variable we can control is the count of 8192-sector chunks
>>>> and not their size.  Which means that most of the ESXi discard
>>>> commands will be disregarded by Ceph.
>>>>
>>>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>>>
>>>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>>>> 1342099456, nr_sects 8192)
>>>
>>> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check.
>>>
>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?)
>>
>> This seems to reflect the granularity (4194304), which matches the
>> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
>> value.
>>
>> Can discard_alignment be specified with RBD?
>
> It's exported as a read-only sysfs attribute, just like
> discard_granularity:
>
> # cat /sys/block/rbd0/discard_alignment
> 4194304

Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
discard_alignment in /sys/block/<device>/queue, but for RBD it's in
/sys/block/<device> - could this be the source of the issue?

Here is what I get querying the iscsi-exported RBD device on Linux:

root@kio1:/sys/block/sdf#  sg_inq -p 0xB0 /dev/sdf
VPD INQUIRY: Block limits page (SBC)
  Maximum compare and write length: 255 blocks
  Optimal transfer length granularity: 8 blocks
  Maximum transfer length: 16384 blocks
  Optimal transfer length: 1024 blocks
  Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
  Maximum unmap LBA count: 8192
  Maximum unmap block descriptor count: 4294967295
  Optimal unmap granularity: 8192
  Unmap granularity alignment valid: 1
  Unmap granularity alignment: 8192

>
> Thanks,
>
>                 Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com