Re: bluestore_min_alloc_size and bluefs_shared_alloc_size

Igor Fedotov <igor.fedotov@xxxxxxxx> · Wed, 13 Mar 2024 17:59:54 +0300

Hi Joel,

generally speaking you need OSD redeployment to apply 64K to 4K 
min_alloc_size downgrade for block device only. Other improvements 
(including supporting 4K units for BlueFS) are applied to existing OSDs 
automatically when relevant Ceph release is installed.

So yes - if you have octopus deployed OSDs - you need to redeploy them 
with new default settings.

The latest pacific minor release (v16.2.15) has got all the "space 
allocation" improvements I'm aware of. Quincy is one step behind as 
changes brought by https://github.com/ceph/ceph/pull/54877) haven't been 
published yet - will come in the next minor Quincy release.

One of the bugs (https://tracker.ceph.com/issues/63618) fixed by the 
above PR could be of particular interest for you - that's a pretty 
severe issue for legacy deployed OSDs which pops up after upgrade to 
pacific coupled with custom bluefs_shared_alloc_size setting (<64K). 
Just for you to be aware and as an example of why I discourage everyone 
from using custom settings ;)

Thanks,

Igor

On 3/12/2024 7:45 PM, Joel Davidow wrote:
Hi Igor,

Thanks, that's very helpful.

So in this case the Ceph developers recommend that all osds originally 
built under octopus be redeployed with default settings and that 
default settings continue to be used going forward. Is that correct?

Thanks for your assistance,
Joel

On Tue, Mar 12, 2024 at 4:13 AM Igor Fedotov <igor.fedotov@xxxxxxxx> 
wrote:

    Hi Joel,

    my primary statement would be - do not adjust "alloc size"
    settings on your own and use default values!

    We've had pretty long and convoluted evolution of this stuff so
    tuning recommendations and their aftermaths greatly depend on the
    exact Ceph version. While using improper settings could result in
    severe performance impact and even data loss.

    Current state-of-the-arts is that we support minimal allocation
    size at 4K for everything : both HDDs and SSDs, user and bluefs
    data. Effective bluefs_shared_alloc_size (i.e. allocation unit we
    generally use when BlueFS allocates space for DB [meta]data) is at
    64K but BlueFS can fallback to 4K allocations on its own if main
    disk space fragmentation is high. Higher base value (=64K)
    generally provides less overhead for both performance and metadata
    mem/disk footprint. This approach shouldn't be applied to OSDs
    which run legacy Ceph versions though. They could lack proper
    support for some aspects of this stuff.

    Using legacy 64K min allocation size for block device (aka
    bfm_bytes_per_block) can sometimes result in a significant space
    waste - then one should upgrade to a version which supports 4K
    alloc unit and redeploy legacy OSDs. Again with no custom tunings
    for both new or old OSDs.

    So in short your choice should be: upgrade, redeploy with default
    settings if needed and keep using defaults.

    Hope this helps.

    Thanks,

    Igor

    On 29/02/2024 01:55, Joel Davidow wrote:
    Summary
    ----------
    The relationship of the values configured for bluestore_min_alloc_size and bluefs_shared_alloc_size are reported to impact space amplification, partial overwrites in erasure coded pools, and storage capacity as an osd becomes more fragmented and/or more full.

    Previous discussions including this topic
    ----------------------------------------
    comment #7 in bug 63618 in Dec 2023 -https://tracker.ceph.com/issues/63618#note-7

    pad writeup related to bug 62282 likely from late 2023 -https://pad.ceph.com/p/RCA_62282

    email sent 13 Sept 2023 in mail list discussion of cannot create new osd -https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/5M4QAXJDCNJ74XVIBIFSHHNSETCCKNMC/

    comment #9 in bug 58530 likely from early 2023 -https://tracker.ceph.com/issues/58530#note-9

    email sent 30 Sept 2021 in mail list discussion of flapping osds -https://www.mail-archive.com/ceph-users@xxxxxxx/msg13072.html

    email sent 25 Feb 2020 in mail list discussion of changing allocation size -https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/B3DGKH6THFGHALLX6ATJ4GGD4SVFNEKU/

    Current situation
    -----------------
    We have three Ceph clusters that were originally built via cephadm on octopus and later upgraded to pacific. All osds are HDD (will be moving to wal+db on SSD) and were resharded after the upgrade to enable rocksdb sharding.

    The value for bluefs_shared_alloc_size has remained unchanged at 65535.

    The value for bluestore_min_alloc_size_hdd was 65535 in octopus but is reported as 4096 by ceph daemon osd.<id> config show in pacific. However, the osd label after upgrading to pacific retains the value of 65535 for bfm_bytes_per_block. BitmapFreelistManager.h in Ceph source code (src/os/bluestore/BitmapFreelistManager.h) indicates that bytes_per_block is bdev_block_size.  This indicates that the physical layout of the osd has not changed from 65535 despite the return of the ceph dameon command reporting it as 4096. This interpretation is supported by the Minimum Allocation Size part of the Bluestore configuration reference for quincy (https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#minimum-allocation-size)
    Questions
    ----------
    What are the pros and cons of the following three cases with two variations per case - when using co-located wal+db on HDD and when using separate wal+db on SSD:
    1) bluefs_shared_alloc_size, bluestore_min_alloc_size, and bfm_bytes_per_block all equal2) bluefs_shared_alloc_size greater than but a multiple of bluestore_min_alloc_size with bfm_bytes_per_block equal to bluestore_min_alloc_size
    3) bluefs_shared_alloc_size greater than but a multiple of bluestore_min_alloc_size with bfm_bytes_per_block equal to bluefs_shared_alloc_size
    _______________________________________________
    ceph-users mailing list --ceph-users@xxxxxxx
    To unsubscribe send an email toceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx