Hi, Some of our Proxmox VE users have noticed that a large fstrim inside a QEMU/KVM guest does not free up as much space as expected on the backing RBD image -- if the image is mapped on the host via KRBD and passed to QEMU as a block device (checked via `rbd du --exact`). If the image is attached via QEMU's librbd integration, fstrim seems to work much better. I've found an earlier discussion [0] according to which, for fstrim to work properly, the filesystem should be aligned a object size (4M) boundaries. Indeed, in the test setups I've looked at, the filesystem is not aligned to 4M boundaries. Still, I'm wondering if there might be a solution that doesn't require a specific partitioning/filesystem layout. To have a simpler test setup, I'm not looking at VMs and instead into unaligned blkdiscard on a KRBD-backed block device (on the host). On my test cluster (for versions see [5]), I create an 1G test volume, map it with default settings, write random data to it, and then issue blkdiscard with an 1M offset (see [1] for complete commands): > # blkdiscard --offset 1M /dev/rbd/vmpool/test An `rbd du --exact` reports a size of 256M: > # rbd du --exact -p vmpool test > NAME PROVISIONED USED > test 1 GiB 256 MiB Naively I would expect a result between 1 and 4M, my reasoning being that the 1023M discard could be split into 3M (to get to 4M alignment) plus 1020M. But I've checked the kernel's discard splitting logic [2], and as far as I understand it, it aims to align the discard requests to `discard_granularity`, which is 64k here: > /sys/class/block/rbd0/queue/discard_granularity:65536 I've found I can set the `alloc_size` option [3] to 4M which sets `discard_granularity` to 4M. The result of the blkdiscard is much closer to my expectations (see [4] for complete commands). > # blkdiscard --offset 1M /dev/rbd/vmpool/test > # rbd du --exact -p vmpool test > NAME PROVISIONED USED > test 1 GiB 1 MiB However, apparently with `alloc_size` set to 4M, `minimum_io_size` is also set to 4M (it was 64k before, see [1]): > /sys/class/block/rbd0/queue/minimum_io_size:4194304 My expectation is that this could negatively impact non-discard IO performance (write amplification?). But I am unsure, as I ran a few small benchmarks and couldn't really see any difference between the two settings. Thus, my questions: - Should I expect any downside for non-discard IO after setting `alloc_size` to 4M? - If yes: would it be feasible for KRBD to decouple `discard_granularity` and `minimum_io_size`, i.e., expose an option to set only `discard_granularity` to 4M? Happy about any pointers, and let me know if I can provide any further information. Thanks and best wishes, Friedrich [0] https://www.spinics.net/lists/ceph-users/msg67740.html [1] > # rbd create -p vmpool test --size 1G > # rbd map -p vmpool test > /dev/rbd0 > # grep '' /sys/class/block/rbd0/queue/{discard_*,minimum_io_size,optimal_*} > /sys/class/block/rbd0/queue/discard_granularity:65536 > /sys/class/block/rbd0/queue/discard_max_bytes:4194304 > /sys/class/block/rbd0/queue/discard_max_hw_bytes:4194304 > /sys/class/block/rbd0/queue/discard_zeroes_data:0 > /sys/class/block/rbd0/queue/minimum_io_size:65536 > /sys/class/block/rbd0/queue/optimal_io_size:4194304 > # dd if=/dev/urandom of=/dev/rbd/vmpool/test bs=4M > dd: error writing '/dev/rbd/vmpool/test': No space left on device > 257+0 records in > 256+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.73227 s, 227 MB/s > # rbd du --exact -p vmpool test > NAME PROVISIONED USED > test 1 GiB 1 GiB [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-merge.c?h=v6.11&id=98f7e32f20d28ec452afb208f9cffc08448a2652#n108 [3] https://docs.ceph.com/en/reef/man/8/rbd/ [4] > # rbd map -p vmpool test -o alloc_size=4194304 > /dev/rbd0 > # grep '' /sys/class/block/rbd*/device/config_info > 10.1.1.201:6789,10.1.1.202:6789,10.1.1.203:6789 name=admin,key=client.admin,alloc_size=4194304 vmpool test - > # grep '' /sys/class/block/rbd0/queue/{discard_*,minimum_io_size,optimal_*} > /sys/class/block/rbd0/queue/discard_granularity:4194304 > /sys/class/block/rbd0/queue/discard_max_bytes:4194304 > /sys/class/block/rbd0/queue/discard_max_hw_bytes:4194304 > /sys/class/block/rbd0/queue/discard_zeroes_data:0 > /sys/class/block/rbd0/queue/minimum_io_size:4194304 > /sys/class/block/rbd0/queue/optimal_io_size:4194304 > # dd if=/dev/urandom of=/dev/rbd/vmpool/test bs=4M > dd: error writing '/dev/rbd/vmpool/test': No space left on device > 257+0 records in > 256+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.39016 s, 245 MB/s > # rbd du --exact -p vmpool test > NAME PROVISIONED USED > test 1 GiB 1 GiB [5] Host: Proxmox VE 8.2 but with Ubuntu mainline kernel 6.11 build (6.11.0-061100-generic from https://kernel.ubuntu.com/mainline/v6.11/) Ceph: Proxmox build of 18.2.4, but happy try a different build if needed. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx