Hi Paul,
the ticket you mentioned (https://tracker.ceph.com/issues/53608) is
apparently relevant in your case. And it looks like this hasn't been
completely fixed yet - we've got a bunch of telemetry reports it's still
happening in 16.2.9.
Unfortunately there is no solution so far but this is definitely not a
H/W issue.
Thanks,
Igor
On 8/10/2022 3:22 PM, Paul JURCO wrote:
Hi,
We have two similar clusters in number of hosts and disks, about the same
age with pacific 16.2.9.
Both have a mix of hosts with 1TB and 2TB disks (disks' capacity is not
mixed on hosts for OSDs).
One of the clusters has 21 osd process crashes in the last 7 days, the
other has just 3.
Full stack as reported in ceph-osd log:
2022-08-03T06:39:30.987+0300 7f118dd4d700 -1 *** Caught signal (*Segme**ntation
fault*) **
in thread 7f118dd4d700 *thread_name:tp_osd_tp*
ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
(stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f11b3ad2980]
2: (ceph::buffer::v15_2_0::ptr::release()+0x2d) [0x558cc32238ad]
3: (*BlueStore::Onode::put()*+0x1bc) [0x558cc2ea321c]
4: (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
ghobject_t const&, char const*, ceph::buffer::v15_2_0::ptr&)+0x275)
[0x558cc2ed1525]
5: (PGBackend::objects_get_attr(hobject_t const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0xc7)
[0x558cc2b97ca7]
6: (PrimaryLogPG::get_snapset_context(hobject_t const&, bool,
std::map<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, ceph::buffer::v15_2_0::list,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const,
ceph::buffer::v15_2_0::list> > > const*, bool)+0x3bd) [0x558cc2ade74d]
7: (PrimaryLogPG::get_object_context(hobject_t const&, bool,
std::map<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, ceph::buffer::v15_2_0::list,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const,
ceph::buffer::v15_2_0::list> > > const*)+0x328) [0x558cc2adee48]
8: (PrimaryLogPG::find_object_context(hobject_t const&,
std::shared_ptr<ObjectContext>*, bool, bool, hobject_t*)+0x20f)
[0x558cc2aeac3f]
9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2661)
[0x558cc2b353f1]
10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xcc7) [0x558cc2b42347]
11: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b)
[0x558cc29c6d9b]
12: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6a) [0x558cc2c29b9a]
13: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xd1e) [0x558cc29e4dbe]
14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac)
[0x558cc306a75c]
15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558cc306dc20]
16: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f11b3ac76db]
17: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
There was one bug fixed in 16.2.8 related to *BlueStore::Onode::put():*
*"*os/bluestore: avoid premature onode release (pr#44723
<https://github.com/ceph/ceph/pull/44723>, Igor Fedotov)"
an in tracker: https://tracker.ceph.com/issues/53608
Is this segfault related to the bug? Is this new?
We have upgraded in May '22 from 15.2.13 to 16.2.8 and in 2 days after to
16.2.9 on the cluster with crashes.
6 seg faults are on 2tb disks, 8 are on 1tb disks. 2TB are newer (below
2yo).
Could be related to hardware?
Thank you!
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx