Re: [Scst-devel] Thin Provisioning and Ceph RBD's

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Thu, 18 Aug 2016 12:49:51 -0400

On Sat, Aug 13, 2016 at 4:51 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
> On Sat, Aug 13, 2016 at 12:36 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>> On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>> I'm confused.  How can a 4M discard not free anything?  It's either
>>>>> going to hit an entire object or two adjacent objects, truncating the
>>>>> tail of one and zeroing the head of another.  Using rbd diff:
>>>>>
>>>>> $ rbd diff test | grep -A 1 25165824
>>>>> 25165824  4194304 data
>>>>> 29360128  4194304 data
>>>>>
>>>>> # a 4M discard at 1M into a RADOS object
>>>>> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0
>>>>>
>>>>> $ rbd diff test | grep -A 1 25165824
>>>>> 25165824  1048576 data
>>>>> 29360128  4194304 data
>>>>
>>>> I have tested this on a small RBD device with such offsets and indeed,
>>>> the discard works as you describe, Ilya.
>>>>
>>>> Looking more into why ESXi's discard is not working.  I found this
>>>> message in kern.log on Ubuntu on creation of the SCST LUN, which shows
>>>> unmap_alignment 0:
>>>>
>>>> Aug  6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945)
>>>> Aug  6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin
>>>> provisioning for device /dev/rbd/spin1/unmap1t
>>>> Aug  6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192,
>>>> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1
>>>> Aug  6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI
>>>> target virtual disk p_iSCSILun_sclun945
>>>> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512,
>>>> nblocks=838860800, cyln=409600)
>>>> Aug  6 22:02:33 e1 kernel: [300378.136847] [4682]:
>>>> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32
>>>> Aug  6 22:02:33 e1 kernel: [300378.136853] [4682]: scst:
>>>> scst_alloc_set_UA:12711:Queuing new UA ffff8810251f3a90 (6:29:0,
>>>> d_sense 0) to tgt_dev ffff88102583ad00 (dev p_iSCSILun_sclun945,
>>>> initiator copy_manager_sess)
>>>>
>>>> even though:
>>>>
>>>> root@e1:/sys/block/rbd29# cat discard_alignment
>>>> 4194304
>>>>
>>>> So somehow the discard_alignment is not making it into the LUN.  Could
>>>> this be the issue?
>>>
>>> No, if you are not seeing *any* effect, the alignment is pretty much
>>> irrelevant.  Can you do the following on a small test image?
>>>
>>> - capture "rbd diff" output
>>> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace
>>> - issue a few discards with blkdiscard
>>> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled
>>> - capture "rbd diff" output again
>>>
>>> and attach all of the above?  (You might need to install a blktrace
>>> package.)
>>>
>>
>> Latest results from VMWare validation tests:
>>
>> Each test creates and deletes a virtual disk, then calls ESXi unmap
>> for what ESXi maps to that volume:
>>
>> Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829
>>
>> Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837
>>
>> Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824
>>
>> Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837
>>
>> Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837
>>
>> At the end, the compounded used size via rbd diff is 608 GB from 775GB
>> of data.  So we release only about 20% via discards in the end.
>
> Ilya has analyzed the discard pattern, and indeed the problem is that
> ESXi appears to disregard the discard alignment attribute.  Therefore,
> discards are shifted by 1M, and are not hitting the tail of objects.
>
> Discards work much better on the EagerZeroedThick volumes, likely due
> to contiguous data.
>
> I will proceed with the rest of testing, and will post any tips or
> best practice results as they become available.
>
> Thank you for everyone's help and advice!

Testing completed - the discards definitely follow the alignment pattern:

- 4MB objects and VMFS5 - only some discards due to 1MB discard not
often hitting the tail of object

- 1MB objects - practically 100% space reclaim

I have not tried shifting the VMFS5 filesystem, as the test will not
work with that.  Also not sure how to properly incorporate into VMWare
routine operation.  So, as a best practice:

If you want efficient ESXi space reclaim with RBD and VMFS5, use 1 MB
object size in Ceph

Best regards,
--
Alex Gorbachev
Storcium
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com