Re: KRBD: downside of setting alloc_size=4M for discard alignment?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Oct 25, 2024 at 11:03 AM Friedrich Weber <f.weber@xxxxxxxxxxx> wrote:
>
> Hi,
>
> Some of our Proxmox VE users have noticed that a large fstrim inside a
> QEMU/KVM guest does not free up as much space as expected on the backing
> RBD image -- if the image is mapped on the host via KRBD and passed to
> QEMU as a block device (checked via `rbd du --exact`). If the image is

Hi Friedrich,

"rbd du" can be very imprecise even with --exact flag: one can
construct an image that would use less than 1% of its provisioned space
but "rbd du --exact" would report 100% used.  This is because "rbd du"
works only at the object level, meaning that as long as even a small
part of an object is there, the entire object is reported as used (for
the most part, with one minor exception).

The catch is that an object or some part of it being there doesn't mean
that it actually consumes space on the OSDs.

> attached via QEMU's librbd integration, fstrim seems to work much
> better. I've found an earlier discussion [0] according to which, for
> fstrim to work properly, the filesystem should be aligned a object size
> (4M) boundaries. Indeed, in the test setups I've looked at, the
> filesystem is not aligned to 4M boundaries.
>
> Still, I'm wondering if there might be a solution that doesn't require a
> specific partitioning/filesystem layout. To have a simpler test setup,
> I'm not looking at VMs and instead into unaligned blkdiscard on a
> KRBD-backed block device (on the host).
>
> On my test cluster (for versions see [5]), I create an 1G test volume,
> map it with default settings, write random data to it, and then issue
> blkdiscard with an 1M offset (see [1] for complete commands):
>
> > # blkdiscard --offset 1M /dev/rbd/vmpool/test
>
> An `rbd du --exact` reports a size of 256M:
>
> > # rbd du --exact -p vmpool test
> > NAME  PROVISIONED  USED
> > test        1 GiB  256 MiB

Try the same test, but look at the STORED column of "ceph df" output
for the pool in question.  Note the starting value, after writing 1G
you should see it increase by 1G and after running that blkdiscard
command it should decrease by 1023M, despite "rbd du --exact" reporting
256M as used.

"ceph df" shows how much space is actually consumed on the OSDs, so
this should demonstrate that everything is freed up by blkdiscard, it's
just not reported as freed by "rbd du".  The same sort of test can be
done for fstrim.

>
> Naively I would expect a result between 1 and 4M, my reasoning being
> that the 1023M discard could be split into 3M (to get to 4M alignment)
> plus 1020M. But I've checked the kernel's discard splitting logic [2],
> and as far as I understand it, it aims to align the discard requests to
> `discard_granularity`, which is 64k here:
>
> > /sys/class/block/rbd0/queue/discard_granularity:65536
>
> I've found I can set the `alloc_size` option [3] to 4M which sets
> `discard_granularity` to 4M. The result of the blkdiscard is much closer
> to my expectations (see [4] for complete commands).
>
> > # blkdiscard --offset 1M /dev/rbd/vmpool/test
> > # rbd du --exact -p vmpool test
> > NAME  PROVISIONED  USED
> > test        1 GiB  1 MiB
>
> However, apparently with `alloc_size` set to 4M, `minimum_io_size` is
> also set to 4M (it was 64k before, see [1]):
>
> > /sys/class/block/rbd0/queue/minimum_io_size:4194304
>
> My expectation is that this could negatively impact non-discard IO
> performance (write amplification?). But I am unsure, as I ran a few
> small benchmarks and couldn't really see any difference between the two
> settings. Thus, my questions:
>
> - Should I expect any downside for non-discard IO after setting
> `alloc_size` to 4M?

There is a major downside even for discard I/O.  Bumping alloc_size
to 4M would make the RBD driver ignore _all_ discard requests that are
smaller than 4M -- which would amount to nearly all of discard requests
in regular setups.

> - If yes: would it be feasible for KRBD to decouple
> `discard_granularity` and `minimum_io_size`, i.e., expose an option to
> set only `discard_granularity` to 4M?

I would advise against setting alloc_size option to anything higher
than the default of 64k.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux