Re: bluestore_min_alloc_size and bluefs_shared_alloc_size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

Thanks, that's very helpful.

So in this case the Ceph developers recommend that all osds originally
built under octopus be redeployed with default settings and that default
settings continue to be used going forward. Is that correct?

Thanks for your assistance,
Joel


On Tue, Mar 12, 2024 at 4:13 AM Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

> Hi Joel,
>
> my primary statement would be - do not adjust "alloc size" settings on
> your own and use default values!
>
> We've had pretty long and convoluted evolution of this stuff so tuning
> recommendations and their aftermaths greatly depend on the exact Ceph
> version. While using improper settings could result in severe performance
> impact and even data loss.
>
> Current state-of-the-arts is that we support minimal allocation size at 4K
> for everything : both HDDs and SSDs, user and bluefs data. Effective
> bluefs_shared_alloc_size (i.e. allocation unit we generally use when BlueFS
> allocates space for DB [meta]data) is at 64K but BlueFS can fallback to 4K
> allocations on its own if main disk space fragmentation is high. Higher
> base value (=64K) generally provides less overhead for both performance and
> metadata mem/disk footprint. This approach shouldn't be applied to OSDs
> which run legacy Ceph versions though. They could lack proper support for
> some aspects of this stuff.
> Using legacy 64K min allocation size for block device (aka bfm_bytes_per_block)
> can sometimes result in a significant space waste - then one should upgrade
> to a version which supports 4K alloc unit and redeploy legacy OSDs. Again
> with no custom tunings for both new or old OSDs.
>
> So in short your choice should be: upgrade, redeploy with default settings
> if needed and keep using defaults.
>
>
> Hope this helps.
>
> Thanks,
>
> Igor
> On 29/02/2024 01:55, Joel Davidow wrote:
>
> Summary
> ----------
> The relationship of the values configured for bluestore_min_alloc_size and bluefs_shared_alloc_size are reported to impact space amplification, partial overwrites in erasure coded pools, and storage capacity as an osd becomes more fragmented and/or more full.
>
>
> Previous discussions including this topic
> ----------------------------------------
> comment #7 in bug 63618 in Dec 2023 - https://tracker.ceph.com/issues/63618#note-7
>
> pad writeup related to bug 62282 likely from late 2023 - https://pad.ceph.com/p/RCA_62282
>
> email sent 13 Sept 2023 in mail list discussion of cannot create new osd - https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/5M4QAXJDCNJ74XVIBIFSHHNSETCCKNMC/
>
> comment #9 in bug 58530 likely from early 2023 - https://tracker.ceph.com/issues/58530#note-9
>
> email sent 30 Sept 2021 in mail list discussion of flapping osds - https://www.mail-archive.com/ceph-users@xxxxxxx/msg13072.html
>
> email sent 25 Feb 2020 in mail list discussion of changing allocation size - https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/B3DGKH6THFGHALLX6ATJ4GGD4SVFNEKU/
>
>
> Current situation
> -----------------
> We have three Ceph clusters that were originally built via cephadm on octopus and later upgraded to pacific. All osds are HDD (will be moving to wal+db on SSD) and were resharded after the upgrade to enable rocksdb sharding.
>
> The value for bluefs_shared_alloc_size has remained unchanged at 65535.
>
> The value for bluestore_min_alloc_size_hdd was 65535 in octopus but is reported as 4096 by ceph daemon osd.<id> config show in pacific. However, the osd label after upgrading to pacific retains the value of 65535 for bfm_bytes_per_block. BitmapFreelistManager.h in Ceph source code (src/os/bluestore/BitmapFreelistManager.h) indicates that bytes_per_block is bdev_block_size.  This indicates that the physical layout of the osd has not changed from 65535 despite the return of the ceph dameon command reporting it as 4096. This interpretation is supported by the Minimum Allocation Size part of the Bluestore configuration reference for quincy (https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#minimum-allocation-size)
>
> Questions
> ----------
> What are the pros and cons of the following three cases with two variations per case - when using co-located wal+db on HDD and when using separate wal+db on SSD:
> 1) bluefs_shared_alloc_size, bluestore_min_alloc_size, and bfm_bytes_per_block all equal2) bluefs_shared_alloc_size greater than but a multiple of bluestore_min_alloc_size with bfm_bytes_per_block equal to bluestore_min_alloc_size
> 3) bluefs_shared_alloc_size greater than but a multiple of bluestore_min_alloc_size with bfm_bytes_per_block equal to bluefs_shared_alloc_size
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux