Re: down OSDs, Bluestore out of space, unable to restart

Igor Fedotov <igor.fedotov@xxxxxxxx> · Thu, 28 Nov 2024 14:52:34 +0300

Hi Frederic,

here is an overview of the case when BlueFS ıs unable to allocate more 
space at main/shared device albeıt free space is available. Below I'm 
talking about stuff exısted before fıxıng 
https://tracker.ceph.com/issues/53466.

First of al - BlueFS's minimal allocation unit for shared device was 
bluefs_shared_alloc_size (=64K by default). Which means that it was 
unable to use e.g. 2x32K or 16x4K chunks when it needed additional 64K 
bytes.

Secondly - sometimes RocksDB performs recovery - and some other 
maintenance tasks that require space allocation - on startup. Which 
evidently triggers allocation of N*64K chunks from shared device.

Thirdly - a while ago we switched to 4K chunk allocations for user data 
(please not confuse with BlueFS allocation). Which potentially could 
result ın specific free space fragmentation pattern when there ıs 
limited (or even empty) set of long (>=64K) chunks free. Still 
technically having enough free space available. E.g. free extent list 
could look like (off~len, both in hex):

0x0~1000, 0x2000~1000, 0x4000~2000, 0x10000~4000, 0x2000~1000, etc...

In that case original BlueFS allocator implementation was unable to 
locate more free space which in turn was effectively breaking both 
RockDB and OSD boot up.

One should realize that the above free space fragmentation depends on a 
bunch of factors, none of which is absolutely dominating:

1. how user write/remove objects

2. how allocator seeks for free space

3. how much free space is available

So we don't have full control on 1. and 3. and have limited 
opportunities in tuning 2.

Small device sizes and high space utilization severely increase the 
probability for the issue to happen but theoretically even a large disk 
with mediocre utilization could reach "bad" state over time if used (by 
both clients and allocator) "improperly/inefficiently". Hence tuning 
thresholds can reduce the issue's probability to occur (at cost of 
additional spare space waste) but it isn't a silver bullet.

https://tracker.ceph.com/issues/53466 fixes (or rather works around) the 
issue by allowing BlueFS to use 4K extents. Plus we're working on making 
better resulting free space fragmentation on aged OSDs by improving 
allocation strategies, e.g. see :

- https://github.com/ceph/ceph/pull/52489

- https://github.com/ceph/ceph/pull/57789

- https://github.com/ceph/ceph/pull/60870

Hope this is helpful.

Thanks,

Igor

On 27.11.2024 16:31, Frédéric Nass wrote:

----- Le 27 Nov 24, à 10:19, Igor Fedotov <igor.fedotov@xxxxxxxx> a 
écrit :

    Hi Istvan,

    first of all let me make a remark that we don't know why BlueStore
    is out of space at John's cluster.

    It's just an unconfirmed hypothesis from Frederic that it's caused
    by high fragmentation and BlueFS'es inability to use chunks
    smaller than 64K. In fact fragmentation issue is fixed since
    17.2.6 so I doubt that's the problem.

Hi Igor,

I wasn't actually pointing this as the root cause (since John's 
already using 17.2.6) but more to explain the context, but while we're 
at it...

Could you elaborate on circumstances that could prevent BlueFS from 
being able to allocate chunks in collocated OSDs scenario? Does this 
ability depend on near/full thresholds being reached or not? If so 
then icreasing these thresholds by 1-2% may help avoiding the crash, no?

Also, if BlueFS is aware of these thresholds, shouldn't an OSDs be 
able to start and live without crashing even when it's full and simply 
(maybe easier said than done...) refuse any I/Os? Sorry for the noob 
questions. :-)

This topic is particularly important when using NVMe drives as 
'collocated' OSDs, expecially since they often host critical metadata 
pools (cephfs, rgw index).

Cheers,
Frédéric.

    Thanks,

    Igor

    On 27.11.2024 4:01, Szabo, Istvan (Agoda) wrote:

        Hi,

        This issue should not happen anymore from 17.2.8 am I correct?
        In this version all the fragmentation issue should have gone
        even with collocated wal+db+block.

        ------------------------------------------------------------------------
        *From:* Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
        <mailto:frederic.nass@xxxxxxxxxxxxxxxx>
        *Sent:* Wednesday, November 27, 2024 6:12:46 AM
        *To:* John Jasen <jjasen@xxxxxxxxx> <mailto:jjasen@xxxxxxxxx>
        *Cc:* Igor Fedotov <igor.fedotov@xxxxxxxx>
        <mailto:igor.fedotov@xxxxxxxx>; ceph-users
        <ceph-users@xxxxxxx> <mailto:ceph-users@xxxxxxx>
        *Subject:*  Re: down OSDs, Bluestore out of space,
        unable to restart

        Email received from the internet. If in doubt, don't click any
        link nor open any attachment !
        ________________________________

        Hi John,

        That's about right. Two potential solutions exist:
        1. Adding a new drive to the server and sharing it for RocksDB
        metadata, or
        2. Repurposing one of the failed OSDs for the same purpose (if
        adding more drives isn't feasible).

        Igor's post #6 [1] explains the challenges with co-located
        OSDs (DB+WAL+data on the same device) when they run out of
        space, where significant fragmentation occurs and BlueFS and
        BlueStore block sizes are misaligned. The solution (included
        in 17.2.6) was to allow BlueFS to allocate 4k extents when it
        couldn't find 64k contiguous extents. However, it seems that
        even with this fix, these OSDs still can't boot up.

        Therefore, the recommendation is to extend the RocksDB volume
        to another device as a temporary workaround.

        Before proceeding, I recommend checking the failed OSDs'
        bluefs_shared_alloc_size value. If it's 64k, you might want to
        try lowering this to 32k or even 4k, as some users reported
        [2] that reducing this value helped failed OSDs boot up and
        remain stable for a period of time. Might be worth checking
        and trying.

        Regards,
        Frédéric.

        [1] https://tracker.ceph.com/issues/53466#note-6
        [2]
        https://github.com/rook/rook/issues/9885#issuecomment-1761076861";
        <https://github.com/rook/rook/issues/9885#issuecomment-1761076861";>

        ________________________________
        De : John Jasen <jjasen@xxxxxxxxx> <mailto:jjasen@xxxxxxxxx>
        Envoyé : mardi 26 novembre 2024 18:50
        À : Igor Fedotov
        Cc: ceph-users
        Objet :  Re: down OSDs, Bluestore out of space,
        unable to restart

        Let me see if I have the approach right'ish:

        scrounge some more disk for the servers with full/down OSDs.
        partition the new disks into LVs for each downed OSD.
        Attach as a lvm new-db to the downed OSDs.
        Restart the OSDs.
        Profit.

        Is that about right?

        On Tue, Nov 26, 2024 at 11:28 AM Igor Fedotov
        <igor.fedotov@xxxxxxxx> <mailto:igor.fedotov@xxxxxxxx> wrote:

        > Well, so there is a single shared volume (disk) per OSD, right?
        >
        > If so one can add dedicated DB volume to such an OSD - one
        done OSD will
        > have two underlying devices: main(which is original shared
        disk) and new
        > dedicated DB ones.  And hence this will effectively provide
        additional
        > space for BlueFS/RocksDB and permit OSD to start up.
        >
        > I'm not aware of all the details how to do that with cephadm
        (or whatever
        > RH uses) but on bare metal setup this could be achieved by
        issuing
        > 'ceph-volume lvm new-db' command which will attach new LV
        (provided by
        > user) to specific OSD.
        >
        >
        > Thanks,
        >
        > Igor
        >
        >
        > On 26.11.2024 19:16, John Jasen wrote:
        >
        > They're all bluefs_single_shared_device, if I understand
        your question.
        > There's no room left on the devices to expand.
        >
        > We started at quincy with this cluster, and didn't vary too
        much from the
        > Redhat Ceph storage 6 documentation for setting it up.
        >
        >
        > On Tue, Nov 26, 2024 at 4:48 AM Igor Fedotov
        <igor.fedotov@xxxxxxxx> <mailto:igor.fedotov@xxxxxxxx>
        > wrote:
        >
        >> Hi John,
        >>
        >> you haven't described your OSD volume configuration but you
        might want
        >> to try adding standalone DB volume if OSD uses LVM and has
        single main
        >> device only.
        >>
        >> 'ceph-volume lvm new-db' command is the preferred way of
        doing that, see
        >>
        >> https://docs.ceph.com/en/quincy/ceph-volume/lvm/newdb/
        >>
        >>
        >> Thanks,
        >>
        >> Igor
        >>
        >> On 25.11.2024 21:37, John Jasen wrote:
        >> > Ceph version 17.2.6
        >> >
        >> > After a power loss event affecting my ceph cluster, I've
        been putting
        >> > humpty dumpty back together since.
        >> >
        >> > One problem I face is that with objects degraded,
        rebalancing doesn't
        >> run
        >> > -- and this resulted in several of my fast OSDs filling up.
        >> >
        >> > I have 8 OSDs currently down, 100% full (exceeding all
        the full ratio
        >> > settings on by default or I toggled to try and keep it
        together), and
        >> when
        >> > I try to restart them, they fail out. Is there any way to
        bring these
        >> back
        >> > from the dead?
        >> >
        >> > Here's some interesting output from journalctl -xeu on
        the failed OSD:
        >> >
        >> > ceph-osd[2383080]:
        bluestore::NCB::__restore_allocator::No Valid
        >> allocation
        >> > info on disk (empty file)
        >> > ceph-osd[2383080]: bluestore(/var/lib/ceph/osd/ceph-242)
        >> > _init_alloc::NCB::restore_allocator() failed! Run Full
        Recovery from
        >> ONodes
        >> > (might take a while) ...
        >> >
        >> > ceph-osd[2389725]: bluefs _allocate allocation failed,
        needed 0x3000
        >> >
        >> >
        ceph-6ab85342-53d6-11ee-88a7-e43d1a153e91-osd-242[2389718]: -2>
        >> > 2024-11-25T18:31:42.070+0000 7f0adfdef540 -1 bluefs
        _flush_range_F
        >> > allocated: 0x0 offset: 0x0 length: 0x230f
        >> > ceph-osd[2389725]: bluefs _flush_range_F allocated: 0x0
        offset: 0x0
        >> length:
        >> > 0x230f
        >> >
        >> > Followed quickly by an abort:
        >> >
        >> >
        >>
        /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueFS.cc:
        >> > In funct>
        >> >
        >> >
        >>
        /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueFS.cc:
        >> > 3380: ce>
        >> >
        >> > ceph
        >> version
        >> > 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
        (stable)
        >> > 1:
        >> > (ceph::__ceph_abort(char const*, int, char const*,
        >> > std::__cxx11::basic_string<char, std::char_traits<char>,
        >> > std::allocator<char> > const&)+0xd7) [0x559bf4361d2f]
        >> > 2:
        >> > (BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned
        long, unsigned
        >> > long)+0x7a9) [0x559bf4b225f9]
        >> > 3:
        >> > (BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0xa2)
        >> [0x559bf4b22812]
        >> > 4:
        >> > (BlueFS::fsync(BlueFS::FileWriter*)+0x8e) [0x559bf4b40c3e]
        >> > 5:
        >> > (BlueRocksWritableFile::Sync()+0x19) [0x559bf4b51ed9]
        >> > 6:
        >> >
        (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions
        const&,
        >> > rocksdb::IODebugContext*)+0x22) [0x559bf507fbd2]
        >> > 7:
        >> > (rocksdb::WritableFileWriter::SyncInternal(bool)+0x5aa)
        [0x559bf51a880a]
        >> > 8:
        >> > (rocksdb::WritableFileWriter::Sync(bool)+0x100)
        [0x559bf51aa0a0]
        >> > 9:
        >> > (rocksdb::SyncManifest(rocksdb::Env*,
        rocksdb::ImmutableDBOptions
        >> const*,
        >> > rocksdb::WritableFileWriter*)+0x10b) [0x559bf51a3bfb]
        >> > 10:
        >> >
        >>
        (rocksdb::VersionSet::ProcessManifestWrites(std::deque<rocksdb::VersionSet::ManifestWriter,
        >> > std::allocator<rocksdb::VersionSet::ManifestWriter> >&,
        >> > rocksdb::InstrumentedMutex*, rocksdb::FSDirectory*, bool,
        rocks>
        >> > 11:
        >> >
        >>
        (rocksdb::VersionSet::LogAndApply(rocksdb::autovector<rocksdb::ColumnFamilyData*,
        >> > 8ul> const&,
        rocksdb::autovector<rocksdb::MutableCFOptions const*, 8ul>
        >> > const&, rocksdb::autovector<rocksdb::autovector<rocksdb::>
        >> > 12:
        >> > (rocksdb::VersionSet::LogAndApply(rocksdb::ColumnFamilyData*,
        >> > rocksdb::MutableCFOptions const&, rocksdb::VersionEdit*,
        >> > rocksdb::InstrumentedMutex*, rocksdb::FSDirectory*, bool,
        >> > rocksdb::ColumnFamilyOptions const>
        >> > 13:
        >> > (rocksdb::DBImpl::DeleteUnreferencedSstFiles()+0xa30)
        [0x559bf50bd250]
        >> > 14:
        >> >
        (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor,
        >> > std::allocator<rocksdb::ColumnFamilyDescriptor> > const&,
        bool, bool,
        >> bool,
        >> > unsigned long*)+0x13f1) [0x559bf50d3f21]
        >> > 15:
        >> > (rocksdb::DBImpl::Open(rocksdb::DBOptions const&,
        >> > std::__cxx11::basic_string<char, std::char_traits<char>,
        >> > std::allocator<char> > const&,
        >> std::vector<rocksdb::ColumnFamilyDescriptor,
        >> > std::allocator<rocksdb::Colu>
        >> > 16:
        >> > (rocksdb::DB::Open(rocksdb::DBOptions const&,
        >> > std::__cxx11::basic_string<char, std::char_traits<char>,
        >> > std::allocator<char> > const&,
        >> std::vector<rocksdb::ColumnFamilyDescriptor,
        >> > std::allocator<rocksdb::ColumnFa>
        >> > 17:
        >> > (RocksDBStore::do_open(std::ostream&, bool, bool,
        >> > std::__cxx11::basic_string<char, std::char_traits<char>,
        >> > std::allocator<char> > const&)+0x77a) [0x559bf503766a]
        >> > 18:
        >> > (BlueStore::_open_db(bool, bool, bool)+0xbb4)
        [0x559bf4a4bff4]
        >> > 19:
        >> > (BlueStore::_open_db_and_around(bool, bool)+0x500)
        [0x559bf4a766e0]
        >> > 20:
        >> > (BlueStore::_mount()+0x396) [0x559bf4a795d6]
        >> > 21:
        >> > (OSD::init()+0x556) [0x559bf44a0eb6]
        >> > 22: main()
        >> > 23:
        >> > __libc_start_main()
        >> > 24:
        >> _start()
        >> >
        >> > *** Caught signal (Aborted) **
        >> > in thread
        >> > 7f0adfdef540 thread_name:ceph-osd
        >> >
        >> > ceph
        >> version
        >> > 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
        (stable)
        >> > 1:
        >> > /lib64/libpthread.so.0(+0x12cf0) [0x7f0addff1cf0]
        >> > 2:
        >> gsignal()
        >> > 3: abort()
        >> > 4:
        >> > (ceph::__ceph_abort(char const*, int, char const*,
        >> > std::__cxx11::basic_string<char, std::char_traits<char>,
        >> > std::allocator<char> > const&)+0x197) [0x559bf4361def]
        >> > 5:
        >> > (BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned
        long, unsigned
        >> > long)+0x7a9) [0x559bf4b225f9]
        >> > 6:
        >> > (BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0xa2)
        >> [0x559bf4b22812]
        >> > 7:
        >> > (BlueFS::fsync(BlueFS::FileWriter*)+0x8e) [0x559bf4b40c3e]
        >> > 8:
        >> > (BlueRocksWritableFile::Sync()+0x19) [0x559bf4b51ed9]
        >> > 9:
        >> >
        (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions
        const&,
        >> > rocksdb::IODebugContext*)+0x22) [0x559bf507fbd2]
        >> > 10:
        >> > (rocksdb::WritableFileWriter::SyncInternal(bool)+0x5aa)
        [0x559bf51a880a]
        >> > 11:
        >> > (rocksdb::WritableFileWriter::Sync(bool)+0x100)
        [0x559bf51aa0a0]
        >> > 12:
        >> > (rocksdb::SyncManifest(rocksdb::Env*,
        rocksdb::ImmutableDBOptions
        >> const*,
        >> > rocksdb::WritableFileWriter*)+0x10b) [0x559bf51a3bfb]
        >> > 13:
        >> >
        >>
        (rocksdb::VersionSet::ProcessManifestWrites(std::deque<rocksdb::VersionSet::ManifestWriter,
        >> > std::allocator<rocksdb::VersionSet::ManifestWriter> >&,
        >> > rocksdb::InstrumentedMutex*, rocksdb::FSDirectory*, bool,
        rocks>
        >> > 14:
        >> >
        >>
        (rocksdb::VersionSet::LogAndApply(rocksdb::autovector<rocksdb::ColumnFamilyData*,
        >> > 8ul> const&,
        rocksdb::autovector<rocksdb::MutableCFOptions const*, 8ul>
        >> > const&, rocksdb::autovector<rocksdb::autovector<rocksdb::>
        >> > 15:
        >> > (rocksdb::VersionSet::LogAndApply(rocksdb::ColumnFamilyData*,
        >> > rocksdb::MutableCFOptions const&, rocksdb::VersionEdit*,
        >> > rocksdb::InstrumentedMutex*, rocksdb::FSDirectory*, bool,
        >> > rocksdb::ColumnFamilyOptions const>
        >> > 16:
        >> > (rocksdb::DBImpl::DeleteUnreferencedSstFiles()+0xa30)
        [0x559bf50bd250]
        >> > 17:
        >> >
        (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor,
        >> > std::allocator<rocksdb::ColumnFamilyDescriptor> > const&,
        bool, bool,
        >> bool,
        >> > unsigned long*)+0x13f1) [0x559bf50d3f21]
        >> > 18:
        >> > (rocksdb::DBImpl::Open(rocksdb::DBOptions const&,
        >> > std::__cxx11::basic_string<char, std::char_traits<char>,
        >> > std::allocator<char> > const&,
        >> std::vector<rocksdb::ColumnFamilyDescriptor,
        >> > std::allocator<rocksdb::Colu>
        >> > 19:
        >> > (rocksdb::DB::Open(rocksdb::DBOptions const&,
        >> > std::__cxx11::basic_string<char, std::char_traits<char>,
        >> > std::allocator<char> > const&,
        >> std::vector<rocksdb::ColumnFamilyDescriptor,
        >> > std::allocator<rocksdb::ColumnFa>
        >> > 20:
        >> > (RocksDBStore::do_open(std::ostream&, bool, bool,
        >> > std::__cxx11::basic_string<char, std::char_traits<char>,
        >> > std::allocator<char> > const&)+0x77a) [0x559bf503766a]
        >> > 21:
        >> > (BlueStore::_open_db(bool, bool, bool)+0xbb4)
        [0x559bf4a4bff4]
        >> > 22:
        >> > (BlueStore::_open_db_and_around(bool, bool)+0x500)
        [0x559bf4a766e0]
        >> > 23:
        >> > (BlueStore::_mount()+0x396) [0x559bf4a795d6]
        >> > 24:
        >> > (OSD::init()+0x556) [0x559bf44a0eb6]
        >> > 25: main()
        >> > 26:
        >> > __libc_start_main()
        >> > 27:
        >> _start()
        >> > NOTE: a
        >> copy
        >> > of the executable, or `objdump -rdS <executable>` is
        needed to interpret
        >> > this.
        >> > _______________________________________________
        >> > ceph-users mailing list -- ceph-users@xxxxxxx
        >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
        >>
        >
        _______________________________________________
        ceph-users mailing list -- ceph-users@xxxxxxx
        To unsubscribe send an email to ceph-users-leave@xxxxxxx
        _______________________________________________
        ceph-users mailing list -- ceph-users@xxxxxxx
        To unsubscribe send an email to ceph-users-leave@xxxxxxx

        ------------------------------------------------------------------------

                  This message is confidential and is for the sole use
                  of the intended recipient(s). It may also be
                  privileged or otherwise protected by copyright or
                  other legal rules. If you have received it by
                  mistake please let us know by reply email and delete
                  it from your system. It is prohibited to copy this
                  message or disclose its content to anyone. Any
                  confidentiality or privilege is not waived or lost
                  by any mistaken delivery or unauthorized disclosure
                  of the message. All messages sent to and from Agoda
                  may be monitored to ensure compliance with company
                  policies, to protect the company's interests and to
                  remove potential malware. Electronic messages may be
                  intercepted, amended, lost or deleted, or contain
                  viruses.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx