Dear Ceph Community,
I hope this message finds you well. I am reaching out to seek
assistance regarding a stability issue we have encountered after
upgrading our Ceph cluster from version Pacific 16.2.3 to Quincy
17.2.8.
Following the upgrade, we have observed that several of our Object
Storage Daemons (OSDs) are experiencing erratic behavior. These OSDs
frequently exhibit a "flapping" condition, where they unexpectedly
go down and then come back up. This issue has predominantly affected
the recently upgraded OSDs within the cluster.
Upon reviewing the logs from the affected OSDs, we encountered the
following messages:
2025-02-03T08:34:09.769+0000 7f0f11390780 -1
bluestore::NCB::__restore_allocator::Failed open_for_read with
error-code -2
2025-02-03T08:38:22.920+0000 7feb9dd44780 -1
bluestore::NCB::__restore_allocator::No Valid allocation info on
disk (empty file)
In an attempt to resolve the issue, we executed the
ceph-bluestore-tool fsck and repair commands. Although these
commands executed successfully, they did not rectify the problem at
hand.
Additionally, we have captured the following crash information from
the ceph logs:
ceph crash info
2025-02-03T09:19:08.749233Z_9e2800fb-77f6-46cb-8087-203ea15a2039
{
"assert_condition": "log.t.seq == log.seq_live",
"assert_file":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/c
eph-17.2.8/src/os/bluestore/BlueFS.cc",
"assert_func": "uint64_t BlueFS::_log_advance_seq()",
"assert_line": 3029,
"assert_msg":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ce
ph-17.2.8/src/os/bluestore/BlueFS.cc: In function 'uint64_t
BlueFS::_log_advance_seq()' thread 7ff983564640 time
2025-02-03T09:19:08.738781+0000\n/home/jenkins-build/build/workspace/ceph-bu
ild/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/os/bluestore/BlueFS.cc: 3029: FAILED
ceph_assert
(log.t.seq == log.seq_live)\n",
"assert_thread_name": "bstore_kv_sync",
"backtrace": [
"/lib64/libc.so.6(+0x3e730) [0x7ff9930f5730]",
"/lib64/libc.so.6(+0x8bbdc) [0x7ff993142bdc]",
"raise()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x179) [0x55882dfb7fdd]",
"/usr/bin/ceph-osd(+0x36b13e) [0x55882dfb813e]",
"/usr/bin/ceph-osd(+0x9cff3b) [0x55882e61cf3b]",
"(BlueFS::_flush_and_sync_log_jump_D(unsigned long)+0x4e)
[0x55882e6291ee]",
"(BlueFS::_compact_log_async_LD_LNF_D()+0x59b) [0x55882e62e8fb]",
"/usr/bin/ceph-osd(+0x9f2b15) [0x55882e63fb15]",
"(BlueFS::fsync(BlueFS::FileWriter*)+0x1b9) [0x55882e631989]",
"/usr/bin/ceph-osd(+0x9f4889) [0x55882e641889]",
"/usr/bin/ceph-osd(+0xd74cd5) [0x55882e9c1cd5]",
"(rocksdb::WritableFileWriter::SyncInternal(bool)+0x483)
[0x55882eade393]",
"(rocksdb::WritableFileWriter::Sync(bool)+0x120) [0x55882eae0b60]",
"(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup
const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned
long)+0x337) [0x55882ea00ab7]",
"(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
unsigned long, bool, unsigned long*, unsigned long, rocksdb
::PreReleaseCallback*)+0x1935) [0x55882ea07675]",
"(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
rocksdb::WriteBatch*)+0x35) [0x55882ea077c5]",
"(RocksDBStore::submit_common(rocksdb::WriteOptions&,
std::shared_ptr<KeyValueDB::TransactionImpl>)+0x83) [0x55882e992593]",
"(RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x99)
[0x55882e992ee9]",
"(BlueStore::_kv_sync_thread()+0xf64) [0x55882e578e24]",
"/usr/bin/ceph-osd(+0x8afb81) [0x55882e4fcb81]",
"/lib64/libc.so.6(+0x89e92) [0x7ff993140e92]",
"/lib64/libc.so.6(+0x10ef20) [0x7ff9931c5f20]"
],
"ceph_version": "17.2.8",
"crash_id":
"2025-02-03T09:19:08.749233Z_9e2800fb-77f6-46cb-8087-203ea15a2039",
"entity_name": "osd.211",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "9",
"os_version_id": "9",
"process_name": "ceph-osd",
"stack_sig":
"ba90de24e2beba9c6a75249a4cce7c533987ca5127cfba5b835a3456174d6080",
"timestamp": "2025-02-03T09:19:08.749233Z",
"utsname_hostname": "afra-osd18",
"utsname_machine": "x86_64",
"utsname_release": "5.15.0-119-generic",
"utsname_sysname": "Linux",
"utsname_version": "#129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024"
}
The above crash logs highlight an assertion failure in the BlueFS
component, specifically within the function
BlueFS::_log_advance_seq(). Despite our efforts to analyze and
resolve the issue, we have reached an impasse.
For completeness, we have verified the health of our disks using
smartctl, and they have all been deemed healthy.
We kindly request guidance from the community on how to address this
issue or any recommended steps for deeper diagnostics. We appreciate
your support and expertise during this troubleshooting process.
OSD logs: https://paste.mozilla.org/6STm6eum
Thank you for your attention and assistance.
Best regards
Nima
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx