Hi Ilya, Thank you for your illuminating response! I thought I had checked `ceph df` during my experiments before, but apparently not carefully enough. :) On 25/10/2024 18:43, Ilya Dryomov wrote: > "rbd du" can be very imprecise even with --exact flag: one can > construct an image that would use less than 1% of its provisioned space > but "rbd du --exact" would report 100% used. This is because "rbd du" > works only at the object level, meaning that as long as even a small > part of an object is there, the entire object is reported as used (for > the most part, with one minor exception). > > The catch is that an object or some part of it being there doesn't mean > that it actually consumes space on the OSDs. Right, I now recall reading your remark [1] about `rbd du --exact` not accounting for "holes" in the objects, and thus reporting numbers that are too big. With that in mind, I suppose I can build such an image with a large discrepancy between actual space usage and usage reported by `rbd du --exact` by only writing data to the "tail" of each 4M object. I tried with an image (4G for nicer numbers) in an otherwise empty pool: # rbd create -p vmpool test --size 4G # rbd map -p vmpool test # for i in $(seq 3 4 4096); do dd if=/dev/urandom of=/dev/rbd/vmpool/test bs=1M oseek=${i} count=1; done `rbd du --exact` reports: # rbd du --exact -p vmpool test NAME PROVISIONED USED test 4 GiB 4 GiB but according to `ceph df`, only 1G is actually used in the pool: --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL vmpool 4 32 1.0 GiB 1.03k 3.0 GiB 1.11 89 GiB So this is an image where `rbd du --exact` reports 100% used, but which takes up only 25% of provisioned space. >> attached via QEMU's librbd integration, fstrim seems to work much >> better. I've found an earlier discussion [0] according to which, for >> fstrim to work properly, the filesystem should be aligned a object size >> (4M) boundaries. Indeed, in the test setups I've looked at, the >> filesystem is not aligned to 4M boundaries. >> >> Still, I'm wondering if there might be a solution that doesn't require a >> specific partitioning/filesystem layout. To have a simpler test setup, >> I'm not looking at VMs and instead into unaligned blkdiscard on a >> KRBD-backed block device (on the host). >> >> On my test cluster (for versions see [5]), I create an 1G test volume, >> map it with default settings, write random data to it, and then issue >> blkdiscard with an 1M offset (see [1] for complete commands): >> >>> # blkdiscard --offset 1M /dev/rbd/vmpool/test >> >> An `rbd du --exact` reports a size of 256M: >> >>> # rbd du --exact -p vmpool test >>> NAME PROVISIONED USED >>> test 1 GiB 256 MiB > > Try the same test, but look at the STORED column of "ceph df" output > for the pool in question. Note the starting value, after writing 1G > you should see it increase by 1G and after running that blkdiscard > command it should decrease by 1023M, despite "rbd du --exact" reporting > 256M as used. Right, this is exactly what happens. After the blkdiscard: # rbd du --exact -p vmpool test NAME PROVISIONED USED test 1 GiB 256 MiB but: # ceph df --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL [...] vmpool 4 32 1.0 MiB 262 3.1 MiB 0 90 GiB Only 1MiB of data is STORED, though the objects are still there. I see that `rbd sparsify` cleans up the objects, but doesn't seem to play well with a VM also accessing the block device (due to exclusive-locks). It might be nice if these objects could be cleaned up somehow without having to stop the VM, but I agree that with respect to the data actually stored on the OSDs, the objects probably don't matter. >> My expectation is that this could negatively impact non-discard IO >> performance (write amplification?). But I am unsure, as I ran a few >> small benchmarks and couldn't really see any difference between the two >> settings. Thus, my questions: >> >> - Should I expect any downside for non-discard IO after setting >> `alloc_size` to 4M? > > There is a major downside even for discard I/O. Bumping alloc_size > to 4M would make the RBD driver ignore _all_ discard requests that are > smaller than 4M -- which would amount to nearly all of discard requests > in regular setups. I tried to reproduce this and noticed that indeed, with alloc_size=4M most <4M discard requests are ignored -- with the exception of requests corresponding exactly to a object tail, e.g.: # grep '' /sys/class/block/rbd*/device/config_info 10.1.1.201:6789,10.1.1.202:6789,10.1.1.203:6789 name=admin,key=client.admin,alloc_size=4194304 vmpool test - # rbd du --exact -p vmpool test NAME PROVISIONED USED test 1 GiB 1 GiB # blkdiscard --offset 1M --length 3M /dev/rbd/vmpool/test # rbd du --exact -p vmpool test NAME PROVISIONED USED test 1 GiB 1021 MiB I guess because the kernel driver doesn't enter the corresponding `if` block in case alloc_size == object_size and the discard corresponds with an object tail [2]. >> - If yes: would it be feasible for KRBD to decouple >> `discard_granularity` and `minimum_io_size`, i.e., expose an option to >> set only `discard_granularity` to 4M? > > I would advise against setting alloc_size option to anything higher > than the default of 64k. Makes sense. Thanks for clearing up my confusion! Best wishes, Friedrich [1] https://www.spinics.net/lists/ceph-users/msg67776.html [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/rbd.c?h=v6.11&id=81983758430957d9a5cb3333fe324fd70cf63e7e#n2298 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx