Re: [Scst-devel] Thin Provisioning and Ceph RBD's

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Tuesday, August 2, 2016, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>>> Hi Ilya,
>>>
>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>> RBD illustration showing RBD ignoring discard until a certain
>>>>> threshold - why is that?  This behavior is unfortunately incompatible
>>>>> with ESXi discard (UNMAP) behavior.
>>>>>
>>>>> Is there a way to lower the discard sensitivity on RBD devices?
>>>>>
>>> <snip>
>>>>>
>>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>> print SUM/1024 " KB" }'
>>>>> 819200 KB
>>>>>
>>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28
>>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>>>> print SUM/1024 " KB" }'
>>>>> 782336 KB
>>>>
>>>> Think about it in terms of underlying RADOS objects (4M by default).
>>>> There are three cases:
>>>>
>>>>     discard range       | command
>>>>     -----------------------------------------
>>>>     whole object        | delete
>>>>     object's tail       | truncate
>>>>     object's head       | zero
>>>>
>>>> Obviously, only delete and truncate free up space.  In all of your
>>>> examples, except the last one, you are attempting to discard the head
>>>> of the (first) object.
>>>>
>>>> You can free up as little as a sector, as long as it's the tail:
>>>>
>>>> Offset    Length  Type
>>>> 0         4194304 data
>>>>
>>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>>
>>>> Offset    Length  Type
>>>> 0         4193792 data
>>>
>>> Looks like ESXi is sending in each discard/unmap with the fixed
>>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>>> is a slight reduction in size via rbd diff method, but now I
>>> understand that actual truncate only takes effect when the discard
>>> happens to clip the tail of an image.
>>>
>>> So far looking at
>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>>
>>> ...the only variable we can control is the count of 8192-sector chunks
>>> and not their size.  Which means that most of the ESXi discard
>>> commands will be disregarded by Ceph.
>>>
>>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>>
>>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>>> 1342099456, nr_sects 8192)
>>
>> Yes, correct. However, to make sure that VMware is not (erroneously) enforced to do this, you need to perform one more check.
>>
>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct granularity and alignment (4M, I guess?)
>
> This seems to reflect the granularity (4194304), which matches the
> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
> value.
>
> Can discard_alignment be specified with RBD?

It's exported as a read-only sysfs attribute, just like
discard_granularity:

# cat /sys/block/rbd0/discard_alignment
4194304

Is there a way to perhaps increase the discard granularity?  The way I see it based on the discussion so far, here is why discard/unmap is failing to work with VMWare:

- RBD provides space in 4MB blocks, which must be discarded entirely, or at least hitting the tail.

- SCST communicates to ESXi that discard alignment is 4MB and discard granularity is also 4MB

- ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free anything

What is it were possible to make a 6MB discard granularity?

Thank you,
Alex


Thanks,

                Ilya


--
--
Alex Gorbachev
Storcium

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux