Re: Is it a bug that OSD crashed when it's full?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



If its the same issue, I'd check the fragmentation score on the entire cluster asap. You may have other osds close to the limit and its harder to fix when all your osds cross the line at once. If you drain this one, it may push the other ones into the red zone if your too close, making the problem much worse.

Our cluster has been stable after splitting all the db's to their own volumes.

Really looking forward to the 4k fix.  :) But the workaround seems solid.

Thanks,
Kevin


________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: Tuesday, November 1, 2022 4:34 PM
To: Tony Liu; ceph-users@xxxxxxx; dev@xxxxxxx
Subject:  Re: Is it a bug that OSD crashed when it's full?

Check twice before you click! This email originated from outside PNNL.


Hi Tony,

first of all let me share my understanding of the issue you're facing.
This recalls me an upstream ticket and I presume my root cause analysis
from there (https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftracker.ceph.com%2Fissues%2F57672%23note-9&amp;data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=%2F8iqR9bo0Yg4WZDA8TqI4d8HywpEoChEUCoiXLEE9TM%3D&amp;reserved=0) is applicable
in your case as well.

So generally speaking your OSD isn't 100% full - from the log output one
can see that 0x57acbc000 of 0x6fc8400000 bytes are free. But there are
not enough contiguous 64K chunks for BlueFS to proceed operating..

As a result OSD managed to escape any *full* sentries and reached the
state when it's crashed - these safety means just weren't designed to
take that additional free space fragmentation factor into account...

Similarly the lack of available 64K chunks prevents OSD from starting up
- it needs to write out some more data to BlueFS during startup recovery.

I'm currently working on enabling BlueFS functioning with default main
device allocation unit (=4K) which will hopefully fix the above issue.


Meanwhile you might want to workaround the current  OSD's state by
setting bluefs_shared_allocat_size to 32K - this might have some
operational and performance effects but highly likely OSD should be able
to startup afterwards. Please do not use 4K for now - it's known for
causing more problems in some circumstances. And I'd highly recommend to
redeploy the OSD ASAP as you drained all the data off it - I presume
that's the reason why you want to bring it up instead of letting the
cluster to recover using regular means applied on OSD loss.

Alternative approach would be to add standalone DB volume and migrate
BlueFS there - ceph-volume should be able to do that even in the current
OSD state. Expanding main volume (if backed by LVM and extra spare space
is available) is apparently a valid option too


Thanks,

Igor


On 11/1/2022 8:09 PM, Tony Liu wrote:
> The actual question is that, is crash expected when OSD is full?
> My focus is more on how to prevent this from happening.
> My expectation is that OSD rejects write request when it's full, but not crash.
> Otherwise, no point to have ratio threshold.
> Please let me know if this is the design or a bug.
>
> Thanks!
> Tony
> ________________________________________
> From: Tony Liu <tonyliu0592@xxxxxxxxxxx>
> Sent: October 31, 2022 05:46 PM
> To: ceph-users@xxxxxxx; dev@xxxxxxx
> Subject:  Is it a bug that OSD crashed when it's full?
>
> Hi,
>
> Based on doc, Ceph prevents you from writing to a full OSD so that you don’t lose data.
> In my case, with v16.2.10, OSD crashed when it's full. Is this expected or some bug?
> I'd expect write failure instead of OSD crash. It keeps crashing when tried to bring it up.
> Is there any way to bring it back?
>
>      -7> 2022-10-31T22:52:57.426+0000 7fe37fd94200  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1667256777427646, "job": 1, "event": "recovery_started", "log_files": [23300]}
>      -6> 2022-10-31T22:52:57.426+0000 7fe37fd94200  4 rocksdb: [db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2
>      -5> 2022-10-31T22:52:57.529+0000 7fe37fd94200  3 rocksdb: [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_version>=5.
>      -4> 2022-10-31T22:52:57.592+0000 7fe37fd94200  1 bluefs _allocate unable to allocate 0x90000 on bdev 1, allocator name block, allocator type hybrid, capacity    , block size 0x1000, free 0x57acbc000, fragmentation 0.359784, allocated 0x0
>      -3> 2022-10-31T22:52:57.592+0000 7fe37fd94200 -1 bluefs _allocate allocation failed, needed 0x8064a
>      -2> 2022-10-31T22:52:57.592+0000 7fe37fd94200 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x8064a
>      -1> 2022-10-31T22:52:57.604+0000 7fe37fd94200 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fe37fd94200 time 2022-10-31T22:52:57.593873+0000
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc")
>
>   ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
>   1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x55858d7e2e7c]
>   2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x55858dee8cc1]
>   3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55858dee8fa0]
>   4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x32) [0x55858defa0b2]
>   5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55858df129eb]
>   6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55858e3ae55f]
>   7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x58a) [0x55858e4c02aa]
>   8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) [0x55858e4c1700]
>   9: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x55858e5dce86]
>   10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x26c) [0x55858e5dd7cc]
>   11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x3c) [0x55858e5ddecc]
>   12: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x55858e5ddf5d]
>   13: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2b8) [0x55858e5e13c8]
>   14: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x55858e58be45]
>   15: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) [0x55858e3f0ea5]
>   16: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1c2e) [0x55858e3f35de]
>   17: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0xae8) [0x55858e3f4938]
>   18: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x59d) [0x55858e3ee65d]
>   19: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x15) [0x55858e3ef9f5]
>   20: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10c1) [0x55858e367601]
>   21: (BlueStore::_open_db(bool, bool, bool)+0x8c7) [0x55858ddde857]
>   22: (BlueStore::_open_db_and_around(bool, bool)+0x2f7) [0x55858de4c8f7]
>   23: (BlueStore::_mount()+0x204) [0x55858de4f7b4]
>   24: (OSD::init()+0x380) [0x55858d91d1d0]
>   25: main()
>   26: __libc_start_main()
>   27: _start()
>
>       0> 2022-10-31T22:52:57.617+0000 7fe37fd94200 -1 *** Caught signal (Aborted) **
>   in thread 7fe37fd94200 thread_name:ceph-osd
>
>   ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
>   1: /lib64/libpthread.so.0(+0x12cf0) [0x7fe37dd33cf0]
>   2: gsignal()
>   3: abort()
>   4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x55858d7e2f4d]
>   5: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x55858dee8cc1]
>   6: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55858dee8fa0]
>   7: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x32) [0x55858defa0b2]
>   8: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55858df129eb]
>   9: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55858e3ae55f]
>   10: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x58a) [0x55858e4c02aa]
>   11: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) [0x55858e4c1700]
>   12: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x55858e5dce86]
>   13: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x26c) [0x55858e5dd7cc]
>   14: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x3c) [0x55858e5ddecc]
>   15: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x55858e5ddf5d]
>   16: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2b8) [0x55858e5e13c8]
>   17: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x55858e58be45]
>   18: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) [0x55858e3f0ea5]
>   19: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1c2e) [0x55858e3f35de]
>   20: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0xae8) [0x55858e3f4938]
>   21: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x59d) [0x55858e3ee65d]
>   22: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x15) [0x55858e3ef9f5]
>   23: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10c1) [0x55858e367601]
>   24: (BlueStore::_open_db(bool, bool, bool)+0x8c7) [0x55858ddde857]
>   25: (BlueStore::_open_db_and_around(bool, bool)+0x2f7) [0x55858de4c8f7]
>   26: (BlueStore::_mount()+0x204) [0x55858de4f7b4]
>   27: (OSD::init()+0x380) [0x55858d91d1d0]
>   28: main()
>   29: __libc_start_main()
>   30: _start()
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
>
> Thanks!
> Tony
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcroit.io%2F&amp;data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=SIldbt7YXOthBn7sTSiemjv3boihpq60pMU6yqJlcaw%3D&amp;reserved=0

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcroit.io%2F&amp;data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=SIldbt7YXOthBn7sTSiemjv3boihpq60pMU6yqJlcaw%3D&amp;reserved=0 | YouTube: https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgoo.gl%2FPGE1Bx&amp;data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ZxLwd87jxovJBao%2FL4qfVshNOrPutK5gdQiJDByoEy8%3D&amp;reserved=0

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux