On Sat, Aug 13, 2016 at 4:51 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: > On Sat, Aug 13, 2016 at 12:36 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >> On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: >>>>> I'm confused. How can a 4M discard not free anything? It's either >>>>> going to hit an entire object or two adjacent objects, truncating the >>>>> tail of one and zeroing the head of another. Using rbd diff: >>>>> >>>>> $ rbd diff test | grep -A 1 25165824 >>>>> 25165824 4194304 data >>>>> 29360128 4194304 data >>>>> >>>>> # a 4M discard at 1M into a RADOS object >>>>> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 >>>>> >>>>> $ rbd diff test | grep -A 1 25165824 >>>>> 25165824 1048576 data >>>>> 29360128 4194304 data >>>> >>>> I have tested this on a small RBD device with such offsets and indeed, >>>> the discard works as you describe, Ilya. >>>> >>>> Looking more into why ESXi's discard is not working. I found this >>>> message in kern.log on Ubuntu on creation of the SCST LUN, which shows >>>> unmap_alignment 0: >>>> >>>> Aug 6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945) >>>> Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin >>>> provisioning for device /dev/rbd/spin1/unmap1t >>>> Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, >>>> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 >>>> Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI >>>> target virtual disk p_iSCSILun_sclun945 >>>> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, >>>> nblocks=838860800, cyln=409600) >>>> Aug 6 22:02:33 e1 kernel: [300378.136847] [4682]: >>>> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32 >>>> Aug 6 22:02:33 e1 kernel: [300378.136853] [4682]: scst: >>>> scst_alloc_set_UA:12711:Queuing new UA ffff8810251f3a90 (6:29:0, >>>> d_sense 0) to tgt_dev ffff88102583ad00 (dev p_iSCSILun_sclun945, >>>> initiator copy_manager_sess) >>>> >>>> even though: >>>> >>>> root@e1:/sys/block/rbd29# cat discard_alignment >>>> 4194304 >>>> >>>> So somehow the discard_alignment is not making it into the LUN. Could >>>> this be the issue? >>> >>> No, if you are not seeing *any* effect, the alignment is pretty much >>> irrelevant. Can you do the following on a small test image? >>> >>> - capture "rbd diff" output >>> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace >>> - issue a few discards with blkdiscard >>> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled >>> - capture "rbd diff" output again >>> >>> and attach all of the above? (You might need to install a blktrace >>> package.) >>> >> >> Latest results from VMWare validation tests: >> >> Each test creates and deletes a virtual disk, then calls ESXi unmap >> for what ESXi maps to that volume: >> >> Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829 >> >> Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837 >> >> Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824 >> >> Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837 >> >> Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837 >> >> At the end, the compounded used size via rbd diff is 608 GB from 775GB >> of data. So we release only about 20% via discards in the end. > > Ilya has analyzed the discard pattern, and indeed the problem is that > ESXi appears to disregard the discard alignment attribute. Therefore, > discards are shifted by 1M, and are not hitting the tail of objects. > > Discards work much better on the EagerZeroedThick volumes, likely due > to contiguous data. > > I will proceed with the rest of testing, and will post any tips or > best practice results as they become available. > > Thank you for everyone's help and advice! Testing completed - the discards definitely follow the alignment pattern: - 4MB objects and VMFS5 - only some discards due to 1MB discard not often hitting the tail of object - 1MB objects - practically 100% space reclaim I have not tried shifting the VMFS5 filesystem, as the test will not work with that. Also not sure how to properly incorporate into VMWare routine operation. So, as a best practice: If you want efficient ESXi space reclaim with RBD and VMFS5, use 1 MB object size in Ceph Best regards, -- Alex Gorbachev Storcium _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com