If its the same issue, I'd check the fragmentation score on the entire cluster asap. You may have other osds close to the limit and its harder to fix when all your osds cross the line at once. If you drain this one, it may push the other ones into the red zone if your too close, making the problem much worse. Our cluster has been stable after splitting all the db's to their own volumes. Really looking forward to the 4k fix. :) But the workaround seems solid. Thanks, Kevin ________________________________________ From: Igor Fedotov <igor.fedotov@xxxxxxxx> Sent: Tuesday, November 1, 2022 4:34 PM To: Tony Liu; ceph-users@xxxxxxx; dev@xxxxxxx Subject: Re: Is it a bug that OSD crashed when it's full? Check twice before you click! This email originated from outside PNNL. Hi Tony, first of all let me share my understanding of the issue you're facing. This recalls me an upstream ticket and I presume my root cause analysis from there (https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftracker.ceph.com%2Fissues%2F57672%23note-9&data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2F8iqR9bo0Yg4WZDA8TqI4d8HywpEoChEUCoiXLEE9TM%3D&reserved=0) is applicable in your case as well. So generally speaking your OSD isn't 100% full - from the log output one can see that 0x57acbc000 of 0x6fc8400000 bytes are free. But there are not enough contiguous 64K chunks for BlueFS to proceed operating.. As a result OSD managed to escape any *full* sentries and reached the state when it's crashed - these safety means just weren't designed to take that additional free space fragmentation factor into account... Similarly the lack of available 64K chunks prevents OSD from starting up - it needs to write out some more data to BlueFS during startup recovery. I'm currently working on enabling BlueFS functioning with default main device allocation unit (=4K) which will hopefully fix the above issue. Meanwhile you might want to workaround the current OSD's state by setting bluefs_shared_allocat_size to 32K - this might have some operational and performance effects but highly likely OSD should be able to startup afterwards. Please do not use 4K for now - it's known for causing more problems in some circumstances. And I'd highly recommend to redeploy the OSD ASAP as you drained all the data off it - I presume that's the reason why you want to bring it up instead of letting the cluster to recover using regular means applied on OSD loss. Alternative approach would be to add standalone DB volume and migrate BlueFS there - ceph-volume should be able to do that even in the current OSD state. Expanding main volume (if backed by LVM and extra spare space is available) is apparently a valid option too Thanks, Igor On 11/1/2022 8:09 PM, Tony Liu wrote: > The actual question is that, is crash expected when OSD is full? > My focus is more on how to prevent this from happening. > My expectation is that OSD rejects write request when it's full, but not crash. > Otherwise, no point to have ratio threshold. > Please let me know if this is the design or a bug. > > Thanks! > Tony > ________________________________________ > From: Tony Liu <tonyliu0592@xxxxxxxxxxx> > Sent: October 31, 2022 05:46 PM > To: ceph-users@xxxxxxx; dev@xxxxxxx > Subject: Is it a bug that OSD crashed when it's full? > > Hi, > > Based on doc, Ceph prevents you from writing to a full OSD so that you don’t lose data. > In my case, with v16.2.10, OSD crashed when it's full. Is this expected or some bug? > I'd expect write failure instead of OSD crash. It keeps crashing when tried to bring it up. > Is there any way to bring it back? > > -7> 2022-10-31T22:52:57.426+0000 7fe37fd94200 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1667256777427646, "job": 1, "event": "recovery_started", "log_files": [23300]} > -6> 2022-10-31T22:52:57.426+0000 7fe37fd94200 4 rocksdb: [db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2 > -5> 2022-10-31T22:52:57.529+0000 7fe37fd94200 3 rocksdb: [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_version>=5. > -4> 2022-10-31T22:52:57.592+0000 7fe37fd94200 1 bluefs _allocate unable to allocate 0x90000 on bdev 1, allocator name block, allocator type hybrid, capacity , block size 0x1000, free 0x57acbc000, fragmentation 0.359784, allocated 0x0 > -3> 2022-10-31T22:52:57.592+0000 7fe37fd94200 -1 bluefs _allocate allocation failed, needed 0x8064a > -2> 2022-10-31T22:52:57.592+0000 7fe37fd94200 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x8064a > -1> 2022-10-31T22:52:57.604+0000 7fe37fd94200 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fe37fd94200 time 2022-10-31T22:52:57.593873+0000 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc") > > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) > 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x55858d7e2e7c] > 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x55858dee8cc1] > 3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55858dee8fa0] > 4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x32) [0x55858defa0b2] > 5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55858df129eb] > 6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55858e3ae55f] > 7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x58a) [0x55858e4c02aa] > 8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) [0x55858e4c1700] > 9: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x55858e5dce86] > 10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x26c) [0x55858e5dd7cc] > 11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x3c) [0x55858e5ddecc] > 12: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x55858e5ddf5d] > 13: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2b8) [0x55858e5e13c8] > 14: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x55858e58be45] > 15: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) [0x55858e3f0ea5] > 16: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1c2e) [0x55858e3f35de] > 17: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0xae8) [0x55858e3f4938] > 18: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x59d) [0x55858e3ee65d] > 19: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x15) [0x55858e3ef9f5] > 20: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10c1) [0x55858e367601] > 21: (BlueStore::_open_db(bool, bool, bool)+0x8c7) [0x55858ddde857] > 22: (BlueStore::_open_db_and_around(bool, bool)+0x2f7) [0x55858de4c8f7] > 23: (BlueStore::_mount()+0x204) [0x55858de4f7b4] > 24: (OSD::init()+0x380) [0x55858d91d1d0] > 25: main() > 26: __libc_start_main() > 27: _start() > > 0> 2022-10-31T22:52:57.617+0000 7fe37fd94200 -1 *** Caught signal (Aborted) ** > in thread 7fe37fd94200 thread_name:ceph-osd > > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) > 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fe37dd33cf0] > 2: gsignal() > 3: abort() > 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x55858d7e2f4d] > 5: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x55858dee8cc1] > 6: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55858dee8fa0] > 7: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x32) [0x55858defa0b2] > 8: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55858df129eb] > 9: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55858e3ae55f] > 10: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x58a) [0x55858e4c02aa] > 11: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) [0x55858e4c1700] > 12: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x55858e5dce86] > 13: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x26c) [0x55858e5dd7cc] > 14: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x3c) [0x55858e5ddecc] > 15: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x55858e5ddf5d] > 16: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2b8) [0x55858e5e13c8] > 17: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x55858e58be45] > 18: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) [0x55858e3f0ea5] > 19: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1c2e) [0x55858e3f35de] > 20: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0xae8) [0x55858e3f4938] > 21: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x59d) [0x55858e3ee65d] > 22: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x15) [0x55858e3ef9f5] > 23: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10c1) [0x55858e367601] > 24: (BlueStore::_open_db(bool, bool, bool)+0x8c7) [0x55858ddde857] > 25: (BlueStore::_open_db_and_around(bool, bool)+0x2f7) [0x55858de4c8f7] > 26: (BlueStore::_mount()+0x204) [0x55858de4f7b4] > 27: (OSD::init()+0x380) [0x55858d91d1d0] > 28: main() > 29: __libc_start_main() > 30: _start() > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > > Thanks! > Tony > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Igor Fedotov Ceph Lead Developer Looking for help with your Ceph cluster? Contact us at https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcroit.io%2F&data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=SIldbt7YXOthBn7sTSiemjv3boihpq60pMU6yqJlcaw%3D&reserved=0 croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcroit.io%2F&data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=SIldbt7YXOthBn7sTSiemjv3boihpq60pMU6yqJlcaw%3D&reserved=0 | YouTube: https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgoo.gl%2FPGE1Bx&data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZxLwd87jxovJBao%2FL4qfVshNOrPutK5gdQiJDByoEy8%3D&reserved=0 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx