16.2.9 High rate of Segmentation fault on ceph-osd processes

Paul JURCO <paul.jurco@xxxxxxxxx> · Wed, 10 Aug 2022 15:22:29 +0300

Hi,
We have two similar clusters in number of hosts and disks, about the same
age with pacific 16.2.9.
Both have a mix of hosts with 1TB and 2TB  disks (disks' capacity is not
mixed on hosts for OSDs).
One of the clusters has 21 osd process crashes in the last 7 days, the
other has just 3.
Full stack as reported in ceph-osd log:

2022-08-03T06:39:30.987+0300 7f118dd4d700 -1 *** Caught signal (*Segme**ntation
fault*) **

 in thread 7f118dd4d700 *thread_name:tp_osd_tp*

 ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
(stable)

 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f11b3ad2980]

 2: (ceph::buffer::v15_2_0::ptr::release()+0x2d) [0x558cc32238ad]

 3: (*BlueStore::Onode::put()*+0x1bc) [0x558cc2ea321c]

 4: (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
ghobject_t const&, char const*, ceph::buffer::v15_2_0::ptr&)+0x275)
[0x558cc2ed1525]

 5: (PGBackend::objects_get_attr(hobject_t const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0xc7)
[0x558cc2b97ca7]

 6: (PrimaryLogPG::get_snapset_context(hobject_t const&, bool,
std::map<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, ceph::buffer::v15_2_0::list,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const,
ceph::buffer::v15_2_0::list> > > const*, bool)+0x3bd) [0x558cc2ade74d]

 7: (PrimaryLogPG::get_object_context(hobject_t const&, bool,
std::map<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, ceph::buffer::v15_2_0::list,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const,
ceph::buffer::v15_2_0::list> > > const*)+0x328) [0x558cc2adee48]

 8: (PrimaryLogPG::find_object_context(hobject_t const&,
std::shared_ptr<ObjectContext>*, bool, bool, hobject_t*)+0x20f)
[0x558cc2aeac3f]

 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2661)
[0x558cc2b353f1]

 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xcc7) [0x558cc2b42347]

 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b)
[0x558cc29c6d9b]

 12: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6a) [0x558cc2c29b9a]

 13: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xd1e) [0x558cc29e4dbe]

 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac)
[0x558cc306a75c]

 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558cc306dc20]

 16: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f11b3ac76db]

 17: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

There was one bug fixed in 16.2.8 related to *BlueStore::Onode::put():*
*"*os/bluestore: avoid premature onode release (pr#44723
<https://github.com/ceph/ceph/pull/44723>, Igor Fedotov)"
an in tracker: https://tracker.ceph.com/issues/53608
Is this segfault related to the bug? Is this new?

We have upgraded in May '22 from 15.2.13 to 16.2.8 and in 2 days after to
16.2.9 on the cluster with crashes.
6 seg faults are on 2tb disks, 8 are on 1tb disks. 2TB are newer (below
2yo).
Could be related to hardware?
Thank you!
-- 
Paul Jurco
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx