Re: Small RGW objects and RADOS 64KB minimun size

Josh Durgin <jdurgin@xxxxxxxxxx> · Tue, 16 Feb 2021 10:14:02 -0800

Changing min_alloc_size in bluestore requires redeploying the OSD.
There's no other way to regain the space that's already allocated.

In terms of making this easier, we're looking to automate rolling format
changes across a cluster with cephadm in the future.

Josh

On 2/16/21 9:58 AM, Steven Pine wrote:
Will there be a well documented strategy / method for changing block sizes
on existing clusters? Is there anything that could be done to optimize or
assist clusters in the cut over?

On Tue, Feb 16, 2021 at 3:41 AM Loïc Dachary <loic@xxxxxxxxxxx> wrote:

Hi Josh :-)

Thanks for the update: this is great news and I look forward to using this
once Pacific is released.

Cheers

On 16/02/2021 00:43, Josh Durgin wrote:
Hello Loic!

We have developed a strategy in pacific - reducing the min_alloc_size
for HDD to 4KB by default.

Igor Fedotov did a lot of investigation and benchmarking, and came up
with some improvements to bluestore [1][2] to make this change have
little performance impact (it even increases performance in many cases).

Josh

[0] https://github.com/ceph/ceph/pull/34588
[1] https://github.com/ceph/ceph/pull/33434
[2] https://github.com/ceph/ceph/pull/33365

On 2/14/21 9:21 AM, Loïc Dachary wrote:
Bonjour,

Reading Karan's blog post about benchmarking the insertion of billions
objects to Ceph via S3 / RGW[0] from last year, it reads:

we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test.
As represented in chart-5, the object creation rate found to be notably
reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB
(default) to 18KB. As such, for objects larger than the
bluestore_min_alloc_size_hdd , the default values seems to be optimal,
smaller objects further require more investigation if you intended to
reduce bluestore_min_alloc_size_hdd parameter.

There also is a mail thread dated 2018 on this topic as well, with the
same conclusion although using RADOS directly and not RGW[3]. I read the
RGW data layout page in the documentation[1] and concluded that by default
every object inserted with S3 / RGW will indeed use at least 64kb. A pull
request from last year[2] seems to confirm it and also suggests modifying
bluestore_min_alloc_size_hdd  has adverse side effects.

That being said, I'm curious to know if people developed strategies to
cope with this overhead. Someone mentioned packing objects together client
side to make them larger. But maybe there are simpler ways to do the same?

Cheers

[0]
https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
[1] https://docs.ceph.com/en/latest/radosgw/layout/
[2] https://github.com/ceph/ceph/pull/32809
[3] https://www.spinics.net/lists/ceph-users/msg45755.html

--
Loïc Dachary, Artisan Logiciel Libre

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx