Re: down OSDs, Bluestore out of space, unable to restart

Igor Fedotov <igor.fedotov@xxxxxxxx> · Wed, 27 Nov 2024 12:36:36 +0300

Istvan,

Unfortunately there is no such a formula.

It completely depends on allocation/release pattern happened to disk. 
Which in turn depends on how clients performed object writes/removals.

My general observation is that the issue tends to happen on small drives 
and/or very high space utilization.

Your numbers don't look like that for me..

Anyway the major symptom of the issue is getting BlueFS (only!) 
allocation failures when some free space (e.g. a few GBs) is still 
available.

And once again - we have no clue what's the actual issue with Jonh's 
cluster as the logs I saw were pretty stripped... Adding standalone DB 
volume can be helpful against BlueFS (!) "no-space" failures in any case 
though. This should permit to bring up OSDs and to remove some data (or 
export PGs from them) to free the space. If main device is out of space 
writing to such OSD still wouldn't work gthough.

Thanks,

Igor

On 27.11.2024 8:32, Szabo, Istvan (Agoda) wrote:
Hi,

I'd like to understand how the free space allocation is calculated 
when osd crash happens and says no free space on the device  (maybe 
due to fragmentation or allocation issue).
I checked all the graphs back to September when we had multiple osd 
failures on Octopus 15.2.17 co-located wal+db+block.

 *
    OSD sizes 3.5TiB
 *
    OSD fullness was around 57%.
 *
    The crashed OSDs db size were around 160GiB

Based on these values considering the bluefs_shared_allocation_size 
64k, is there any formula that I can use to predict how full is the osds?
Adding on the top of this some calculation with object number or some 
other meta information?

Thank you

------------------------------------------------------------------------
*From:* Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
*Sent:* Wednesday, November 27, 2024 10:33 AM
*To:* Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>; John Jasen 
<jjasen@xxxxxxxxx>; Igor Fedotov <igor.fedotov@xxxxxxxx>
*Cc:* ceph-users <ceph-users@xxxxxxx>
*Subject:*  Re: down OSDs, Bluestore out of space, unable 
to restart
Got it, the perf dump can give information:
ceph daemon osd.x perf dump|jq .bluefs

________________________________
From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Sent: Wednesday, November 27, 2024 9:20 AM
To: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>; John Jasen 
<jjasen@xxxxxxxxx>; Igor Fedotov <igor.fedotov@xxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxx>
Subject:  Re: down OSDs, Bluestore out of space, unable to 
restart

Hi,

Is there a way to check without shutting down an osd the remaining 
free space on the co-located osd how much has left for DB or how full 
is it?

________________________________
From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
Sent: Wednesday, November 27, 2024 6:11 AM
To: John Jasen <jjasen@xxxxxxxxx>
Cc: Igor Fedotov <igor.fedotov@xxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject:  Re: down OSDs, Bluestore out of space, unable to 
restart

Email received from the internet. If in doubt, don't click any link 
nor open any attachment !
________________________________

Hi John,

That's about right. Two potential solutions exist:
1. Adding a new drive to the server and sharing it for RocksDB 
metadata, or
2. Repurposing one of the failed OSDs for the same purpose (if adding 
more drives isn't feasible).

Igor's post #6 [1] explains the challenges with co-located OSDs 
(DB+WAL+data on the same device) when they run out of space, where 
significant fragmentation occurs and BlueFS and BlueStore block sizes 
are misaligned. The solution (included in 17.2.6) was to allow BlueFS 
to allocate 4k extents when it couldn't find 64k contiguous extents. 
However, it seems that even with this fix, these OSDs still can't boot up.

Therefore, the recommendation is to extend the RocksDB volume to 
another device as a temporary workaround.

Before proceeding, I recommend checking the failed OSDs' 
bluefs_shared_alloc_size value. If it's 64k, you might want to try 
lowering this to 32k or even 4k, as some users reported [2] that 
reducing this value helped failed OSDs boot up and remain stable for a 
period of time. Might be worth checking and trying.

Regards,
Frédéric.

[1] https://tracker.ceph.com/issues/53466#note-6
[2] https://github.com/rook/rook/issues/9885#issuecomment-1761076861"; 
<https://github.com/rook/rook/issues/9885#issuecomment-1761076861";>

________________________________
De : John Jasen <jjasen@xxxxxxxxx>
Envoyé : mardi 26 novembre 2024 18:50
À : Igor Fedotov
Cc: ceph-users
Objet :  Re: down OSDs, Bluestore out of space, unable to 
restart

Let me see if I have the approach right'ish:

scrounge some more disk for the servers with full/down OSDs.
partition the new disks into LVs for each downed OSD.
Attach as a lvm new-db to the downed OSDs.
Restart the OSDs.
Profit.

Is that about right?

On Tue, Nov 26, 2024 at 11:28 AM Igor Fedotov <igor.fedotov@xxxxxxxx> 
wrote:

> Well, so there is a single shared volume (disk) per OSD, right?
>
> If so one can add dedicated DB volume to such an OSD - one done OSD will
> have two underlying devices: main(which is original shared disk) and new
> dedicated DB ones.  And hence this will effectively provide additional
> space for BlueFS/RocksDB and permit OSD to start up.
>
> I'm not aware of all the details how to do that with cephadm (or 
whatever
> RH uses) but on bare metal setup this could be achieved by issuing
> 'ceph-volume lvm new-db' command which will attach new LV (provided by
> user) to specific OSD.
>
>
> Thanks,
>
> Igor
>
>
> On 26.11.2024 19:16, John Jasen wrote:
>
> They're all bluefs_single_shared_device, if I understand your question.
> There's no room left on the devices to expand.
>
> We started at quincy with this cluster, and didn't vary too much 
from the
> Redhat Ceph storage 6 documentation for setting it up.
>
>
> On Tue, Nov 26, 2024 at 4:48 AM Igor Fedotov <igor.fedotov@xxxxxxxx>
> wrote:
>
>> Hi John,
>>
>> you haven't described your OSD volume configuration but you might want
>> to try adding standalone DB volume if OSD uses LVM and has single main
>> device only.
>>
>> 'ceph-volume lvm new-db' command is the preferred way of doing 
that, see
>>
>> https://docs.ceph.com/en/quincy/ceph-volume/lvm/newdb/
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 25.11.2024 21:37, John Jasen wrote:
>> > Ceph version 17.2.6
>> >
>> > After a power loss event affecting my ceph cluster, I've been putting
>> > humpty dumpty back together since.
>> >
>> > One problem I face is that with objects degraded, rebalancing doesn't
>> run
>> > -- and this resulted in several of my fast OSDs filling up.
>> >
>> > I have 8 OSDs currently down, 100% full (exceeding all the full ratio
>> > settings on by default or I toggled to try and keep it together), and
>> when
>> > I try to restart them, they fail out. Is there any way to bring these
>> back
>> > from the dead?
>> >
>> > Here's some interesting output from journalctl -xeu on the failed 
OSD:
>> >
>> > ceph-osd[2383080]: bluestore::NCB::__restore_allocator::No Valid
>> allocation
>> > info on disk (empty file)
>> > ceph-osd[2383080]: bluestore(/var/lib/ceph/osd/ceph-242)
>> > _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from
>> ONodes
>> > (might take a while) ...
>> >
>> > ceph-osd[2389725]: bluefs _allocate allocation failed, needed 0x3000
>> >
>> > ceph-6ab85342-53d6-11ee-88a7-e43d1a153e91-osd-242[2389718]: -2>
>> > 2024-11-25T18:31:42.070+0000 7f0adfdef540 -1 bluefs _flush_range_F
>> > allocated: 0x0 offset: 0x0 length: 0x230f
>> > ceph-osd[2389725]: bluefs _flush_range_F allocated: 0x0 offset: 0x0
>> length:
>> > 0x230f
>> >
>> > Followed quickly by an abort:
>> >
>> >
>> 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueFS.cc:
>> > In funct>
>> >
>> >
>> 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueFS.cc:
>> > 3380: ce>
>> >
>> > ceph
>> version
>> > 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
>> > 1:
>> > (ceph::__ceph_abort(char const*, int, char const*,
>> > std::__cxx11::basic_string<char, std::char_traits<char>,
>> > std::allocator<char> > const&)+0xd7) [0x559bf4361d2f]
>> > 2:
>> > (BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned
>> > long)+0x7a9) [0x559bf4b225f9]
>> > 3:
>> > (BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0xa2)
>> [0x559bf4b22812]
>> > 4:
>> > (BlueFS::fsync(BlueFS::FileWriter*)+0x8e) [0x559bf4b40c3e]
>> > 5:
>> > (BlueRocksWritableFile::Sync()+0x19) [0x559bf4b51ed9]
>> > 6:
>> > (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&,
>> > rocksdb::IODebugContext*)+0x22) [0x559bf507fbd2]
>> > 7:
>> > (rocksdb::WritableFileWriter::SyncInternal(bool)+0x5aa) 
[0x559bf51a880a]
>> > 8:
>> > (rocksdb::WritableFileWriter::Sync(bool)+0x100) [0x559bf51aa0a0]
>> > 9:
>> > (rocksdb::SyncManifest(rocksdb::Env*, rocksdb::ImmutableDBOptions
>> const*,
>> > rocksdb::WritableFileWriter*)+0x10b) [0x559bf51a3bfb]
>> > 10:
>> >
>> 
(rocksdb::VersionSet::ProcessManifestWrites(std::deque<rocksdb::VersionSet::ManifestWriter,
>> > std::allocator<rocksdb::VersionSet::ManifestWriter> >&,
>> > rocksdb::InstrumentedMutex*, rocksdb::FSDirectory*, bool, rocks>
>> > 11:
>> >
>> 
(rocksdb::VersionSet::LogAndApply(rocksdb::autovector<rocksdb::ColumnFamilyData*,
>> > 8ul> const&, rocksdb::autovector<rocksdb::MutableCFOptions 
const*, 8ul>
>> > const&, rocksdb::autovector<rocksdb::autovector<rocksdb::>
>> > 12:
>> > (rocksdb::VersionSet::LogAndApply(rocksdb::ColumnFamilyData*,
>> > rocksdb::MutableCFOptions const&, rocksdb::VersionEdit*,
>> > rocksdb::InstrumentedMutex*, rocksdb::FSDirectory*, bool,
>> > rocksdb::ColumnFamilyOptions const>
>> > 13:
>> > (rocksdb::DBImpl::DeleteUnreferencedSstFiles()+0xa30) 
[0x559bf50bd250]
>> > 14:
>> > 
(rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool,
>> bool,
>> > unsigned long*)+0x13f1) [0x559bf50d3f21]
>> > 15:
>> > (rocksdb::DBImpl::Open(rocksdb::DBOptions const&,
>> > std::__cxx11::basic_string<char, std::char_traits<char>,
>> > std::allocator<char> > const&,
>> std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::Colu>
>> > 16:
>> > (rocksdb::DB::Open(rocksdb::DBOptions const&,
>> > std::__cxx11::basic_string<char, std::char_traits<char>,
>> > std::allocator<char> > const&,
>> std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::ColumnFa>
>> > 17:
>> > (RocksDBStore::do_open(std::ostream&, bool, bool,
>> > std::__cxx11::basic_string<char, std::char_traits<char>,
>> > std::allocator<char> > const&)+0x77a) [0x559bf503766a]
>> > 18:
>> > (BlueStore::_open_db(bool, bool, bool)+0xbb4) [0x559bf4a4bff4]
>> > 19:
>> > (BlueStore::_open_db_and_around(bool, bool)+0x500) [0x559bf4a766e0]
>> > 20:
>> > (BlueStore::_mount()+0x396) [0x559bf4a795d6]
>> > 21:
>> > (OSD::init()+0x556) [0x559bf44a0eb6]
>> > 22: main()
>> > 23:
>> > __libc_start_main()
>> > 24:
>> _start()
>> >
>> > *** Caught signal (Aborted) **
>> > in thread
>> > 7f0adfdef540 thread_name:ceph-osd
>> >
>> > ceph
>> version
>> > 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
>> > 1:
>> > /lib64/libpthread.so.0(+0x12cf0) [0x7f0addff1cf0]
>> > 2:
>> gsignal()
>> > 3: abort()
>> > 4:
>> > (ceph::__ceph_abort(char const*, int, char const*,
>> > std::__cxx11::basic_string<char, std::char_traits<char>,
>> > std::allocator<char> > const&)+0x197) [0x559bf4361def]
>> > 5:
>> > (BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned
>> > long)+0x7a9) [0x559bf4b225f9]
>> > 6:
>> > (BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0xa2)
>> [0x559bf4b22812]
>> > 7:
>> > (BlueFS::fsync(BlueFS::FileWriter*)+0x8e) [0x559bf4b40c3e]
>> > 8:
>> > (BlueRocksWritableFile::Sync()+0x19) [0x559bf4b51ed9]
>> > 9:
>> > (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&,
>> > rocksdb::IODebugContext*)+0x22) [0x559bf507fbd2]
>> > 10:
>> > (rocksdb::WritableFileWriter::SyncInternal(bool)+0x5aa) 
[0x559bf51a880a]
>> > 11:
>> > (rocksdb::WritableFileWriter::Sync(bool)+0x100) [0x559bf51aa0a0]
>> > 12:
>> > (rocksdb::SyncManifest(rocksdb::Env*, rocksdb::ImmutableDBOptions
>> const*,
>> > rocksdb::WritableFileWriter*)+0x10b) [0x559bf51a3bfb]
>> > 13:
>> >
>> 
(rocksdb::VersionSet::ProcessManifestWrites(std::deque<rocksdb::VersionSet::ManifestWriter,
>> > std::allocator<rocksdb::VersionSet::ManifestWriter> >&,
>> > rocksdb::InstrumentedMutex*, rocksdb::FSDirectory*, bool, rocks>
>> > 14:
>> >
>> 
(rocksdb::VersionSet::LogAndApply(rocksdb::autovector<rocksdb::ColumnFamilyData*,
>> > 8ul> const&, rocksdb::autovector<rocksdb::MutableCFOptions 
const*, 8ul>
>> > const&, rocksdb::autovector<rocksdb::autovector<rocksdb::>
>> > 15:
>> > (rocksdb::VersionSet::LogAndApply(rocksdb::ColumnFamilyData*,
>> > rocksdb::MutableCFOptions const&, rocksdb::VersionEdit*,
>> > rocksdb::InstrumentedMutex*, rocksdb::FSDirectory*, bool,
>> > rocksdb::ColumnFamilyOptions const>
>> > 16:
>> > (rocksdb::DBImpl::DeleteUnreferencedSstFiles()+0xa30) 
[0x559bf50bd250]
>> > 17:
>> > 
(rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool,
>> bool,
>> > unsigned long*)+0x13f1) [0x559bf50d3f21]
>> > 18:
>> > (rocksdb::DBImpl::Open(rocksdb::DBOptions const&,
>> > std::__cxx11::basic_string<char, std::char_traits<char>,
>> > std::allocator<char> > const&,
>> std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::Colu>
>> > 19:
>> > (rocksdb::DB::Open(rocksdb::DBOptions const&,
>> > std::__cxx11::basic_string<char, std::char_traits<char>,
>> > std::allocator<char> > const&,
>> std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::ColumnFa>
>> > 20:
>> > (RocksDBStore::do_open(std::ostream&, bool, bool,
>> > std::__cxx11::basic_string<char, std::char_traits<char>,
>> > std::allocator<char> > const&)+0x77a) [0x559bf503766a]
>> > 21:
>> > (BlueStore::_open_db(bool, bool, bool)+0xbb4) [0x559bf4a4bff4]
>> > 22:
>> > (BlueStore::_open_db_and_around(bool, bool)+0x500) [0x559bf4a766e0]
>> > 23:
>> > (BlueStore::_mount()+0x396) [0x559bf4a795d6]
>> > 24:
>> > (OSD::init()+0x556) [0x559bf44a0eb6]
>> > 25: main()
>> > 26:
>> > __libc_start_main()
>> > 27:
>> _start()
>> > NOTE: a
>> copy
>> > of the executable, or `objdump -rdS <executable>` is needed to 
interpret
>> > this.
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

________________________________
This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by 
copyright or other legal rules. If you have received it by mistake 
please let us know by reply email and delete it from your system. It 
is prohibited to copy this message or disclose its content to anyone. 
Any confidentiality or privilege is not waived or lost by any mistaken 
delivery or unauthorized disclosure of the message. All messages sent 
to and from Agoda may be monitored to ensure compliance with company 
policies, to protect the company's interests and to remove potential 
malware. Electronic messages may be intercepted, amended, lost or 
deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx