Hi Igor,
Thank you for taking the time to explains the fragmentation issue. I
had figured out the most part of it by reading the tracker and the PR
but it's always clearer when you explain it.
My question was more about why bluefs would still fail to allocate 4k
chunks after being allowed to do so by
https://tracker.ceph.com/issues/53466 (John's case with v17.2.6 actually)
Is BlueFS aware of the remaining space and maybe using some sort of
reserved blocks/chunks like other filesystems to handle full/near null
situations ? If so, then it should never crash, right?
Like other filesystems don't crash, drives's firmwares dont crash, etc.
Thanks,
Frédéric.
----- Le 28 Nov 24, à 12:52, Igor Fedotov <igor.fedotov@xxxxxxxx> a
écrit :
Hi Frederic,
here is an overview of the case when BlueFS ıs unable to allocate
more space at main/shared device albeıt free space is available.
Below I'm talking about stuff exısted before fıxıng
https://tracker.ceph.com/issues/53466.
First of al - BlueFS's minimal allocation unit for shared device
was bluefs_shared_alloc_size (=64K by default). Which means that
it was unable to use e.g. 2x32K or 16x4K chunks when it needed
additional 64K bytes.
Secondly - sometimes RocksDB performs recovery - and some other
maintenance tasks that require space allocation - on startup.
Which evidently triggers allocation of N*64K chunks from shared
device.
Thirdly - a while ago we switched to 4K chunk allocations for user
data (please not confuse with BlueFS allocation). Which
potentially could result ın specific free space fragmentation
pattern when there ıs limited (or even empty) set of long (>=64K)
chunks free. Still technically having enough free space available.
E.g. free extent list could look like (off~len, both in hex):
0x0~1000, 0x2000~1000, 0x4000~2000, 0x10000~4000, 0x2000~1000, etc...
In that case original BlueFS allocator implementation was unable
to locate more free space which in turn was effectively breaking
both RockDB and OSD boot up.
One should realize that the above free space fragmentation depends
on a bunch of factors, none of which is absolutely dominating:
1. how user write/remove objects
2. how allocator seeks for free space
3. how much free space is available
So we don't have full control on 1. and 3. and have limited
opportunities in tuning 2.
Small device sizes and high space utilization severely increase
the probability for the issue to happen but theoretically even a
large disk with mediocre utilization could reach "bad" state over
time if used (by both clients and allocator)
"improperly/inefficiently". Hence tuning thresholds can reduce the
issue's probability to occur (at cost of additional spare space
waste) but it isn't a silver bullet.
https://tracker.ceph.com/issues/53466 fixes (or rather works
around) the issue by allowing BlueFS to use 4K extents. Plus we're
working on making better resulting free space fragmentation on
aged OSDs by improving allocation strategies, e.g. see :
- https://github.com/ceph/ceph/pull/52489
- https://github.com/ceph/ceph/pull/57789
- https://github.com/ceph/ceph/pull/60870
Hope this is helpful.
Thanks,
Igor
On 27.11.2024 16:31, Frédéric Nass wrote:
----- Le 27 Nov 24, à 10:19, Igor Fedotov
<igor.fedotov@xxxxxxxx> <mailto:igor.fedotov@xxxxxxxx> a écrit :
Hi Istvan,
first of all let me make a remark that we don't know why
BlueStore is out of space at John's cluster.
It's just an unconfirmed hypothesis from Frederic that
it's caused by high fragmentation and BlueFS'es inability
to use chunks smaller than 64K. In fact fragmentation
issue is fixed since 17.2.6 so I doubt that's the problem.
Hi Igor,
I wasn't actually pointing this as the root cause (since
John's already using 17.2.6) but more to explain the context,
but while we're at it...
Could you elaborate on circumstances that could prevent BlueFS
from being able to allocate chunks in collocated OSDs
scenario? Does this ability depend on near/full thresholds
being reached or not? If so then icreasing these thresholds by
1-2% may help avoiding the crash, no?
Also, if BlueFS is aware of these thresholds, shouldn't an
OSDs be able to start and live without crashing even when it's
full and simply (maybe easier said than done...) refuse any
I/Os? Sorry for the noob questions. :-)
This topic is particularly important when using NVMe drives as
'collocated' OSDs, expecially since they often host critical
metadata pools (cephfs, rgw index).
Cheers,
Frédéric.
Thanks,
Igor
On 27.11.2024 4:01, Szabo, Istvan (Agoda) wrote:
Hi,
This issue should not happen anymore from 17.2.8 am I
correct? In this version all the fragmentation issue
should have gone even with collocated wal+db+block.
------------------------------------------------------------------------
*From:* Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
<mailto:frederic.nass@xxxxxxxxxxxxxxxx>
*Sent:* Wednesday, November 27, 2024 6:12:46 AM
*To:* John Jasen <jjasen@xxxxxxxxx>
<mailto:jjasen@xxxxxxxxx>
*Cc:* Igor Fedotov <igor.fedotov@xxxxxxxx>
<mailto:igor.fedotov@xxxxxxxx>; ceph-users
<ceph-users@xxxxxxx> <mailto:ceph-users@xxxxxxx>
*Subject:* Re: down OSDs, Bluestore out
of space, unable to restart
Email received from the internet. If in doubt, don't
click any link nor open any attachment !
________________________________
Hi John,
That's about right. Two potential solutions exist:
1. Adding a new drive to the server and sharing it for
RocksDB metadata, or
2. Repurposing one of the failed OSDs for the same
purpose (if adding more drives isn't feasible).
Igor's post #6 [1] explains the challenges with
co-located OSDs (DB+WAL+data on the same device) when
they run out of space, where significant fragmentation
occurs and BlueFS and BlueStore block sizes are
misaligned. The solution (included in 17.2.6) was to
allow BlueFS to allocate 4k extents when it couldn't
find 64k contiguous extents. However, it seems that
even with this fix, these OSDs still can't boot up.
Therefore, the recommendation is to extend the RocksDB
volume to another device as a temporary workaround.
Before proceeding, I recommend checking the failed
OSDs' bluefs_shared_alloc_size value. If it's 64k, you
might want to try lowering this to 32k or even 4k, as
some users reported [2] that reducing this value
helped failed OSDs boot up and remain stable for a
period of time. Might be worth checking and trying.
Regards,
Frédéric.
[1] https://tracker.ceph.com/issues/53466#note-6
[2]
https://github.com/rook/rook/issues/9885#issuecomment-1761076861"
<https://github.com/rook/rook/issues/9885#issuecomment-1761076861">
________________________________
De : John Jasen <jjasen@xxxxxxxxx>
<mailto:jjasen@xxxxxxxxx>
Envoyé : mardi 26 novembre 2024 18:50
À : Igor Fedotov
Cc: ceph-users
Objet : Re: down OSDs, Bluestore out of
space, unable to restart
Let me see if I have the approach right'ish:
scrounge some more disk for the servers with full/down
OSDs.
partition the new disks into LVs for each downed OSD.
Attach as a lvm new-db to the downed OSDs.
Restart the OSDs.
Profit.
Is that about right?
On Tue, Nov 26, 2024 at 11:28 AM Igor Fedotov
<igor.fedotov@xxxxxxxx> <mailto:igor.fedotov@xxxxxxxx>
wrote:
> Well, so there is a single shared volume (disk) per
OSD, right?
>
> If so one can add dedicated DB volume to such an OSD
- one done OSD will
> have two underlying devices: main(which is original
shared disk) and new
> dedicated DB ones. And hence this will effectively
provide additional
> space for BlueFS/RocksDB and permit OSD to start up.
>
> I'm not aware of all the details how to do that with
cephadm (or whatever
> RH uses) but on bare metal setup this could be
achieved by issuing
> 'ceph-volume lvm new-db' command which will attach
new LV (provided by
> user) to specific OSD.
>
>
> Thanks,
>
> Igor
>
>
> On 26.11.2024 19:16, John Jasen wrote:
>
> They're all bluefs_single_shared_device, if I
understand your question.
> There's no room left on the devices to expand.
>
> We started at quincy with this cluster, and didn't
vary too much from the
> Redhat Ceph storage 6 documentation for setting it up.
>
>
> On Tue, Nov 26, 2024 at 4:48 AM Igor Fedotov
<igor.fedotov@xxxxxxxx> <mailto:igor.fedotov@xxxxxxxx>
> wrote:
>
>> Hi John,
>>
>> you haven't described your OSD volume configuration
but you might want
>> to try adding standalone DB volume if OSD uses LVM
and has single main
>> device only.
>>
>> 'ceph-volume lvm new-db' command is the preferred
way of doing that, see
>>
>> https://docs.ceph.com/en/quincy/ceph-volume/lvm/newdb/
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 25.11.2024 21:37, John Jasen wrote:
>> > Ceph version 17.2.6
>> >
>> > After a power loss event affecting my ceph
cluster, I've been putting
>> > humpty dumpty back together since.
>> >
>> > One problem I face is that with objects degraded,
rebalancing doesn't
>> run
>> > -- and this resulted in several of my fast OSDs
filling up.
>> >
>> > I have 8 OSDs currently down, 100% full
(exceeding all the full ratio
>> > settings on by default or I toggled to try and
keep it together), and
>> when
>> > I try to restart them, they fail out. Is there
any way to bring these
>> back
>> > from the dead?
>> >
>> > Here's some interesting output from journalctl
-xeu on the failed OSD:
>> >
>> > ceph-osd[2383080]:
bluestore::NCB::__restore_allocator::No Valid
>> allocation
>> > info on disk (empty file)
>> > ceph-osd[2383080]:
bluestore(/var/lib/ceph/osd/ceph-242)
>> > _init_alloc::NCB::restore_allocator() failed! Run
Full Recovery from
>> ONodes
>> > (might take a while) ...
>> >
>> > ceph-osd[2389725]: bluefs _allocate allocation
failed, needed 0x3000
>> >
>> >
ceph-6ab85342-53d6-11ee-88a7-e43d1a153e91-osd-242[2389718]:
-2>
>> > 2024-11-25T18:31:42.070+0000 7f0adfdef540 -1
bluefs _flush_range_F
>> > allocated: 0x0 offset: 0x0 length: 0x230f
>> > ceph-osd[2389725]: bluefs _flush_range_F
allocated: 0x0 offset: 0x0
>> length:
>> > 0x230f
>> >
>> > Followed quickly by an abort:
>> >
>> >
>>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueFS.cc:
>> > In funct>
>> >
>> >
>>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueFS.cc:
>> > 3380: ce>
>> >
>> > ceph
>> version
>> > 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
quincy (stable)
>> > 1:
>> > (ceph::__ceph_abort(char const*, int, char const*,
>> > std::__cxx11::basic_string<char,
std::char_traits<char>,
>> > std::allocator<char> > const&)+0xd7) [0x559bf4361d2f]
>> > 2:
>> > (BlueFS::_flush_range_F(BlueFS::FileWriter*,
unsigned long, unsigned
>> > long)+0x7a9) [0x559bf4b225f9]
>> > 3:
>> > (BlueFS::_flush_F(BlueFS::FileWriter*, bool,
bool*)+0xa2)
>> [0x559bf4b22812]
>> > 4:
>> > (BlueFS::fsync(BlueFS::FileWriter*)+0x8e)
[0x559bf4b40c3e]
>> > 5:
>> > (BlueRocksWritableFile::Sync()+0x19) [0x559bf4b51ed9]
>> > 6:
>> >
(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions
const&,
>> > rocksdb::IODebugContext*)+0x22) [0x559bf507fbd2]
>> > 7:
>> >
(rocksdb::WritableFileWriter::SyncInternal(bool)+0x5aa)
[0x559bf51a880a]
>> > 8:
>> > (rocksdb::WritableFileWriter::Sync(bool)+0x100)
[0x559bf51aa0a0]
>> > 9:
>> > (rocksdb::SyncManifest(rocksdb::Env*,
rocksdb::ImmutableDBOptions
>> const*,
>> > rocksdb::WritableFileWriter*)+0x10b) [0x559bf51a3bfb]
>> > 10:
>> >
>>
(rocksdb::VersionSet::ProcessManifestWrites(std::deque<rocksdb::VersionSet::ManifestWriter,
>> >
std::allocator<rocksdb::VersionSet::ManifestWriter> >&,
>> > rocksdb::InstrumentedMutex*,
rocksdb::FSDirectory*, bool, rocks>
>> > 11:
>> >
>>
(rocksdb::VersionSet::LogAndApply(rocksdb::autovector<rocksdb::ColumnFamilyData*,
>> > 8ul> const&,
rocksdb::autovector<rocksdb::MutableCFOptions const*, 8ul>
>> > const&,
rocksdb::autovector<rocksdb::autovector<rocksdb::>
>> > 12:
>> >
(rocksdb::VersionSet::LogAndApply(rocksdb::ColumnFamilyData*,
>> > rocksdb::MutableCFOptions const&,
rocksdb::VersionEdit*,
>> > rocksdb::InstrumentedMutex*,
rocksdb::FSDirectory*, bool,
>> > rocksdb::ColumnFamilyOptions const>
>> > 13:
>> >
(rocksdb::DBImpl::DeleteUnreferencedSstFiles()+0xa30)
[0x559bf50bd250]
>> > 14:
>> >
(rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::ColumnFamilyDescriptor> >
const&, bool, bool,
>> bool,
>> > unsigned long*)+0x13f1) [0x559bf50d3f21]
>> > 15:
>> > (rocksdb::DBImpl::Open(rocksdb::DBOptions const&,
>> > std::__cxx11::basic_string<char,
std::char_traits<char>,
>> > std::allocator<char> > const&,
>> std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::Colu>
>> > 16:
>> > (rocksdb::DB::Open(rocksdb::DBOptions const&,
>> > std::__cxx11::basic_string<char,
std::char_traits<char>,
>> > std::allocator<char> > const&,
>> std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::ColumnFa>
>> > 17:
>> > (RocksDBStore::do_open(std::ostream&, bool, bool,
>> > std::__cxx11::basic_string<char,
std::char_traits<char>,
>> > std::allocator<char> > const&)+0x77a)
[0x559bf503766a]
>> > 18:
>> > (BlueStore::_open_db(bool, bool, bool)+0xbb4)
[0x559bf4a4bff4]
>> > 19:
>> > (BlueStore::_open_db_and_around(bool,
bool)+0x500) [0x559bf4a766e0]
>> > 20:
>> > (BlueStore::_mount()+0x396) [0x559bf4a795d6]
>> > 21:
>> > (OSD::init()+0x556) [0x559bf44a0eb6]
>> > 22: main()
>> > 23:
>> > __libc_start_main()
>> > 24:
>> _start()
>> >
>> > *** Caught signal (Aborted) **
>> > in thread
>> > 7f0adfdef540 thread_name:ceph-osd
>> >
>> > ceph
>> version
>> > 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
quincy (stable)
>> > 1:
>> > /lib64/libpthread.so.0(+0x12cf0) [0x7f0addff1cf0]
>> > 2:
>> gsignal()
>> > 3: abort()
>> > 4:
>> > (ceph::__ceph_abort(char const*, int, char const*,
>> > std::__cxx11::basic_string<char,
std::char_traits<char>,
>> > std::allocator<char> > const&)+0x197)
[0x559bf4361def]
>> > 5:
>> > (BlueFS::_flush_range_F(BlueFS::FileWriter*,
unsigned long, unsigned
>> > long)+0x7a9) [0x559bf4b225f9]
>> > 6:
>> > (BlueFS::_flush_F(BlueFS::FileWriter*, bool,
bool*)+0xa2)
>> [0x559bf4b22812]
>> > 7:
>> > (BlueFS::fsync(BlueFS::FileWriter*)+0x8e)
[0x559bf4b40c3e]
>> > 8:
>> > (BlueRocksWritableFile::Sync()+0x19) [0x559bf4b51ed9]
>> > 9:
>> >
(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions
const&,
>> > rocksdb::IODebugContext*)+0x22) [0x559bf507fbd2]
>> > 10:
>> >
(rocksdb::WritableFileWriter::SyncInternal(bool)+0x5aa)
[0x559bf51a880a]
>> > 11:
>> > (rocksdb::WritableFileWriter::Sync(bool)+0x100)
[0x559bf51aa0a0]
>> > 12:
>> > (rocksdb::SyncManifest(rocksdb::Env*,
rocksdb::ImmutableDBOptions
>> const*,
>> > rocksdb::WritableFileWriter*)+0x10b) [0x559bf51a3bfb]
>> > 13:
>> >
>>
(rocksdb::VersionSet::ProcessManifestWrites(std::deque<rocksdb::VersionSet::ManifestWriter,
>> >
std::allocator<rocksdb::VersionSet::ManifestWriter> >&,
>> > rocksdb::InstrumentedMutex*,
rocksdb::FSDirectory*, bool, rocks>
>> > 14:
>> >
>>
(rocksdb::VersionSet::LogAndApply(rocksdb::autovector<rocksdb::ColumnFamilyData*,
>> > 8ul> const&,
rocksdb::autovector<rocksdb::MutableCFOptions const*, 8ul>
>> > const&,
rocksdb::autovector<rocksdb::autovector<rocksdb::>
>> > 15:
>> >
(rocksdb::VersionSet::LogAndApply(rocksdb::ColumnFamilyData*,
>> > rocksdb::MutableCFOptions const&,
rocksdb::VersionEdit*,
>> > rocksdb::InstrumentedMutex*,
rocksdb::FSDirectory*, bool,
>> > rocksdb::ColumnFamilyOptions const>
>> > 16:
>> >
(rocksdb::DBImpl::DeleteUnreferencedSstFiles()+0xa30)
[0x559bf50bd250]
>> > 17:
>> >
(rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::ColumnFamilyDescriptor> >
const&, bool, bool,
>> bool,
>> > unsigned long*)+0x13f1) [0x559bf50d3f21]
>> > 18:
>> > (rocksdb::DBImpl::Open(rocksdb::DBOptions const&,
>> > std::__cxx11::basic_string<char,
std::char_traits<char>,
>> > std::allocator<char> > const&,
>> std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::Colu>
>> > 19:
>> > (rocksdb::DB::Open(rocksdb::DBOptions const&,
>> > std::__cxx11::basic_string<char,
std::char_traits<char>,
>> > std::allocator<char> > const&,
>> std::vector<rocksdb::ColumnFamilyDescriptor,
>> > std::allocator<rocksdb::ColumnFa>
>> > 20:
>> > (RocksDBStore::do_open(std::ostream&, bool, bool,
>> > std::__cxx11::basic_string<char,
std::char_traits<char>,
>> > std::allocator<char> > const&)+0x77a)
[0x559bf503766a]
>> > 21:
>> > (BlueStore::_open_db(bool, bool, bool)+0xbb4)
[0x559bf4a4bff4]
>> > 22:
>> > (BlueStore::_open_db_and_around(bool,
bool)+0x500) [0x559bf4a766e0]
>> > 23:
>> > (BlueStore::_mount()+0x396) [0x559bf4a795d6]
>> > 24:
>> > (OSD::init()+0x556) [0x559bf44a0eb6]
>> > 25: main()
>> > 26:
>> > __libc_start_main()
>> > 27:
>> _start()
>> > NOTE: a
>> copy
>> > of the executable, or `objdump -rdS <executable>`
is needed to interpret
>> > this.
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to
ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
------------------------------------------------------------------------
This message is confidential and is for the
sole use of the intended recipient(s). It
may also be privileged or otherwise
protected by copyright or other legal rules.
If you have received it by mistake please
let us know by reply email and delete it
from your system. It is prohibited to copy
this message or disclose its content to
anyone. Any confidentiality or privilege is
not waived or lost by any mistaken delivery
or unauthorized disclosure of the message.
All messages sent to and from Agoda may be
monitored to ensure compliance with company
policies, to protect the company's interests
and to remove potential malware. Electronic
messages may be intercepted, amended, lost
or deleted, or contain viruses.