KRBD: downside of setting alloc_size=4M for discard alignment?

Friedrich Weber <f.weber@xxxxxxxxxxx> · Fri, 25 Oct 2024 10:57:03 +0200

Hi,

Some of our Proxmox VE users have noticed that a large fstrim inside a
QEMU/KVM guest does not free up as much space as expected on the backing
RBD image -- if the image is mapped on the host via KRBD and passed to
QEMU as a block device (checked via `rbd du --exact`). If the image is
attached via QEMU's librbd integration, fstrim seems to work much
better. I've found an earlier discussion [0] according to which, for
fstrim to work properly, the filesystem should be aligned a object size
(4M) boundaries. Indeed, in the test setups I've looked at, the
filesystem is not aligned to 4M boundaries.

Still, I'm wondering if there might be a solution that doesn't require a
specific partitioning/filesystem layout. To have a simpler test setup,
I'm not looking at VMs and instead into unaligned blkdiscard on a
KRBD-backed block device (on the host).

On my test cluster (for versions see [5]), I create an 1G test volume,
map it with default settings, write random data to it, and then issue
blkdiscard with an 1M offset (see [1] for complete commands):

> # blkdiscard --offset 1M /dev/rbd/vmpool/test

An `rbd du --exact` reports a size of 256M:

> # rbd du --exact -p vmpool test
> NAME  PROVISIONED  USED
> test        1 GiB  256 MiB

Naively I would expect a result between 1 and 4M, my reasoning being
that the 1023M discard could be split into 3M (to get to 4M alignment)
plus 1020M. But I've checked the kernel's discard splitting logic [2],
and as far as I understand it, it aims to align the discard requests to
`discard_granularity`, which is 64k here:

> /sys/class/block/rbd0/queue/discard_granularity:65536

I've found I can set the `alloc_size` option [3] to 4M which sets
`discard_granularity` to 4M. The result of the blkdiscard is much closer
to my expectations (see [4] for complete commands).

> # blkdiscard --offset 1M /dev/rbd/vmpool/test
> # rbd du --exact -p vmpool test
> NAME  PROVISIONED  USED
> test        1 GiB  1 MiB

However, apparently with `alloc_size` set to 4M, `minimum_io_size` is
also set to 4M (it was 64k before, see [1]):

> /sys/class/block/rbd0/queue/minimum_io_size:4194304

My expectation is that this could negatively impact non-discard IO
performance (write amplification?). But I am unsure, as I ran a few
small benchmarks and couldn't really see any difference between the two
settings. Thus, my questions:

- Should I expect any downside for non-discard IO after setting
`alloc_size` to 4M?
- If yes: would it be feasible for KRBD to decouple
`discard_granularity` and `minimum_io_size`, i.e., expose an option to
set only `discard_granularity` to 4M?

Happy about any pointers, and let me know if I can provide any further
information.

Thanks and best wishes,

Friedrich

[0] https://www.spinics.net/lists/ceph-users/msg67740.html
[1]

> # rbd create -p vmpool test --size 1G
> # rbd map -p vmpool test
> /dev/rbd0
> # grep ''
/sys/class/block/rbd0/queue/{discard_*,minimum_io_size,optimal_*}
> /sys/class/block/rbd0/queue/discard_granularity:65536
> /sys/class/block/rbd0/queue/discard_max_bytes:4194304
> /sys/class/block/rbd0/queue/discard_max_hw_bytes:4194304
> /sys/class/block/rbd0/queue/discard_zeroes_data:0
> /sys/class/block/rbd0/queue/minimum_io_size:65536
> /sys/class/block/rbd0/queue/optimal_io_size:4194304
> # dd if=/dev/urandom of=/dev/rbd/vmpool/test bs=4M
> dd: error writing '/dev/rbd/vmpool/test': No space left on device
> 257+0 records in
> 256+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.73227 s, 227 MB/s
> # rbd du --exact -p vmpool test
> NAME  PROVISIONED  USED
> test        1 GiB  1 GiB

[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-merge.c?h=v6.11&id=98f7e32f20d28ec452afb208f9cffc08448a2652#n108
[3] https://docs.ceph.com/en/reef/man/8/rbd/

[4]

> # rbd map -p vmpool test -o alloc_size=4194304
> /dev/rbd0
> # grep '' /sys/class/block/rbd*/device/config_info
> 10.1.1.201:6789,10.1.1.202:6789,10.1.1.203:6789
name=admin,key=client.admin,alloc_size=4194304 vmpool test -
> # grep ''
/sys/class/block/rbd0/queue/{discard_*,minimum_io_size,optimal_*}
> /sys/class/block/rbd0/queue/discard_granularity:4194304
> /sys/class/block/rbd0/queue/discard_max_bytes:4194304
> /sys/class/block/rbd0/queue/discard_max_hw_bytes:4194304
> /sys/class/block/rbd0/queue/discard_zeroes_data:0
> /sys/class/block/rbd0/queue/minimum_io_size:4194304
> /sys/class/block/rbd0/queue/optimal_io_size:4194304
> # dd if=/dev/urandom of=/dev/rbd/vmpool/test bs=4M
> dd: error writing '/dev/rbd/vmpool/test': No space left on device
> 257+0 records in
> 256+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.39016 s, 245 MB/s
> # rbd du --exact -p vmpool test
> NAME  PROVISIONED  USED
> test        1 GiB  1 GiB

[5]

Host: Proxmox VE 8.2 but with Ubuntu mainline kernel 6.11 build
(6.11.0-061100-generic from https://kernel.ubuntu.com/mainline/v6.11/)
Ceph: Proxmox build of 18.2.4, but happy try a different build if needed.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx