Hi, We have two similar clusters in number of hosts and disks, about the same age with pacific 16.2.9. Both have a mix of hosts with 1TB and 2TB disks (disks' capacity is not mixed on hosts for OSDs). One of the clusters has 21 osd process crashes in the last 7 days, the other has just 3. Full stack as reported in ceph-osd log: 2022-08-03T06:39:30.987+0300 7f118dd4d700 -1 *** Caught signal (*Segme**ntation fault*) ** in thread 7f118dd4d700 *thread_name:tp_osd_tp* ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f11b3ad2980] 2: (ceph::buffer::v15_2_0::ptr::release()+0x2d) [0x558cc32238ad] 3: (*BlueStore::Onode::put()*+0x1bc) [0x558cc2ea321c] 4: (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, char const*, ceph::buffer::v15_2_0::ptr&)+0x275) [0x558cc2ed1525] 5: (PGBackend::objects_get_attr(hobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0xc7) [0x558cc2b97ca7] 6: (PrimaryLogPG::get_snapset_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > const*, bool)+0x3bd) [0x558cc2ade74d] 7: (PrimaryLogPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > const*)+0x328) [0x558cc2adee48] 8: (PrimaryLogPG::find_object_context(hobject_t const&, std::shared_ptr<ObjectContext>*, bool, bool, hobject_t*)+0x20f) [0x558cc2aeac3f] 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2661) [0x558cc2b353f1] 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xcc7) [0x558cc2b42347] 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x558cc29c6d9b] 12: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6a) [0x558cc2c29b9a] 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xd1e) [0x558cc29e4dbe] 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x558cc306a75c] 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558cc306dc20] 16: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f11b3ac76db] 17: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. There was one bug fixed in 16.2.8 related to *BlueStore::Onode::put():* *"*os/bluestore: avoid premature onode release (pr#44723 <https://github.com/ceph/ceph/pull/44723>, Igor Fedotov)" an in tracker: https://tracker.ceph.com/issues/53608 Is this segfault related to the bug? Is this new? We have upgraded in May '22 from 15.2.13 to 16.2.8 and in 2 days after to 16.2.9 on the cluster with crashes. 6 seg faults are on 2tb disks, 8 are on 1tb disks. 2TB are newer (below 2yo). Could be related to hardware? Thank you! -- Paul Jurco _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx