Re: KRBD: downside of setting alloc_size=4M for discard alignment?

Friedrich Weber <f.weber@xxxxxxxxxxx> · Thu, 31 Oct 2024 13:31:51 +0100

Hi Ilya,

Thank you for your illuminating response!

I thought I had checked `ceph df` during my experiments before, but
apparently not carefully enough. :)

On 25/10/2024 18:43, Ilya Dryomov wrote:
> "rbd du" can be very imprecise even with --exact flag: one can
> construct an image that would use less than 1% of its provisioned space
> but "rbd du --exact" would report 100% used.  This is because "rbd du"
> works only at the object level, meaning that as long as even a small
> part of an object is there, the entire object is reported as used (for
> the most part, with one minor exception).
> 
> The catch is that an object or some part of it being there doesn't mean
> that it actually consumes space on the OSDs.

Right, I now recall reading your remark [1] about `rbd du --exact` not
accounting for "holes" in the objects, and thus reporting numbers that
are too big.

With that in mind, I suppose I can build such an image with a large
discrepancy between actual space usage and usage reported by `rbd du
--exact` by only writing data to the "tail" of each 4M object.

I tried with an image (4G for nicer numbers) in an otherwise empty pool:

# rbd create -p vmpool test --size 4G
# rbd map -p vmpool test
# for i in $(seq 3 4 4096); do dd if=/dev/urandom
of=/dev/rbd/vmpool/test bs=1M oseek=${i} count=1; done

`rbd du --exact` reports:

# rbd du --exact -p vmpool test
NAME  PROVISIONED  USED
test        4 GiB  4 GiB

but according to `ceph df`, only 1G is actually used in the pool:

--- POOLS ---
POOL    ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
vmpool   4   32  1.0 GiB    1.03k  3.0 GiB   1.11     89 GiB

So this is an image where `rbd du --exact` reports 100% used, but which
takes up only 25% of provisioned space.

>> attached via QEMU's librbd integration, fstrim seems to work much
>> better. I've found an earlier discussion [0] according to which, for
>> fstrim to work properly, the filesystem should be aligned a object size
>> (4M) boundaries. Indeed, in the test setups I've looked at, the
>> filesystem is not aligned to 4M boundaries.
>>
>> Still, I'm wondering if there might be a solution that doesn't require a
>> specific partitioning/filesystem layout. To have a simpler test setup,
>> I'm not looking at VMs and instead into unaligned blkdiscard on a
>> KRBD-backed block device (on the host).
>>
>> On my test cluster (for versions see [5]), I create an 1G test volume,
>> map it with default settings, write random data to it, and then issue
>> blkdiscard with an 1M offset (see [1] for complete commands):
>>
>>> # blkdiscard --offset 1M /dev/rbd/vmpool/test
>>
>> An `rbd du --exact` reports a size of 256M:
>>
>>> # rbd du --exact -p vmpool test
>>> NAME  PROVISIONED  USED
>>> test        1 GiB  256 MiB
> 
> Try the same test, but look at the STORED column of "ceph df" output
> for the pool in question.  Note the starting value, after writing 1G
> you should see it increase by 1G and after running that blkdiscard
> command it should decrease by 1023M, despite "rbd du --exact" reporting
> 256M as used.

Right, this is exactly what happens. After the blkdiscard:

# rbd du --exact -p vmpool test
NAME  PROVISIONED  USED
test        1 GiB  256 MiB

but:

# ceph df
--- POOLS ---
POOL    ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
[...]
vmpool   4   32  1.0 MiB      262  3.1 MiB      0     90 GiB

Only 1MiB of data is STORED, though the objects are still there. I see
that `rbd sparsify` cleans up the objects, but doesn't seem to play well
with a VM also accessing the block device (due to exclusive-locks). It
might be nice if these objects could be cleaned up somehow without
having to stop the VM, but I agree that with respect to the data
actually stored on the OSDs, the objects probably don't matter.

>> My expectation is that this could negatively impact non-discard IO
>> performance (write amplification?). But I am unsure, as I ran a few
>> small benchmarks and couldn't really see any difference between the two
>> settings. Thus, my questions:
>>
>> - Should I expect any downside for non-discard IO after setting
>> `alloc_size` to 4M?
> 
> There is a major downside even for discard I/O.  Bumping alloc_size
> to 4M would make the RBD driver ignore _all_ discard requests that are
> smaller than 4M -- which would amount to nearly all of discard requests
> in regular setups.

I tried to reproduce this and noticed that indeed, with alloc_size=4M
most <4M discard requests are ignored -- with the exception of requests
corresponding exactly to a object tail, e.g.:

# grep '' /sys/class/block/rbd*/device/config_info
10.1.1.201:6789,10.1.1.202:6789,10.1.1.203:6789
name=admin,key=client.admin,alloc_size=4194304 vmpool test -
# rbd du --exact -p vmpool test
NAME  PROVISIONED  USED
test        1 GiB  1 GiB
# blkdiscard --offset 1M --length 3M /dev/rbd/vmpool/test
# rbd du --exact -p vmpool test
NAME  PROVISIONED  USED
test        1 GiB  1021 MiB

I guess because the kernel driver doesn't enter the corresponding `if`
block in case alloc_size == object_size and the discard corresponds with
an object tail [2].

>> - If yes: would it be feasible for KRBD to decouple
>> `discard_granularity` and `minimum_io_size`, i.e., expose an option to
>> set only `discard_granularity` to 4M?
> 
> I would advise against setting alloc_size option to anything higher
> than the default of 64k.

Makes sense. Thanks for clearing up my confusion!

Best wishes,

Friedrich

[1] https://www.spinics.net/lists/ceph-users/msg67776.html
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/rbd.c?h=v6.11&id=81983758430957d9a5cb3333fe324fd70cf63e7e#n2298
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx