Re: 16.2.9 High rate of Segmentation fault on ceph-osd processes

Paul JURCO <paul.jurco@xxxxxxxxx> · Wed, 10 Aug 2022 17:48:50 +0300

Hi!
All restarted as required in upgrade plan in the proper order, all software
was upgraded on all nodes. We are on Ubuntu 18 (all nodes).
"ceph versions" output shows all is on "16.2.9".
Thank you!

-- 
Paul Jurco

On Wed, Aug 10, 2022 at 5:43 PM Eneko Lacunza <elacunza@xxxxxxxxx> wrote:

> Hi Paul,
>
> Did you restart OSDs after upgrading to 16.2.9 (you can just check with
> "ceph versions") ?
>
> All crashes show similar backtrace with BlueStore::Onode::put() ?
>
> Cheers
>
> El 10/8/22 a las 14:22, Paul JURCO escribió:
>
> Hi,
> We have two similar clusters in number of hosts and disks, about the same
> age with pacific 16.2.9.
> Both have a mix of hosts with 1TB and 2TB  disks (disks' capacity is not
> mixed on hosts for OSDs).
> One of the clusters has 21 osd process crashes in the last 7 days, the
> other has just 3.
> Full stack as reported in ceph-osd log:
>
> 2022-08-03T06:39:30.987+0300 7f118dd4d700 -1 *** Caught signal (*Segme**ntation
> fault*) **
>
>  in thread 7f118dd4d700 *thread_name:tp_osd_tp*
>
>
>  ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
> (stable)
>
>  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f11b3ad2980]
>
>  2: (ceph::buffer::v15_2_0::ptr::release()+0x2d) [0x558cc32238ad]
>
>  3: (*BlueStore::Onode::put()*+0x1bc) [0x558cc2ea321c]
>
>  4: (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
> ghobject_t const&, char const*, ceph::buffer::v15_2_0::ptr&)+0x275)
> [0x558cc2ed1525]
>
>  5: (PGBackend::objects_get_attr(hobject_t const&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0xc7)
> [0x558cc2b97ca7]
>
>  6: (PrimaryLogPG::get_snapset_context(hobject_t const&, bool,
> std::map<std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> >, ceph::buffer::v15_2_0::list,
> std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > >,
> std::allocator<std::pair<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const,
> ceph::buffer::v15_2_0::list> > > const*, bool)+0x3bd) [0x558cc2ade74d]
>
>  7: (PrimaryLogPG::get_object_context(hobject_t const&, bool,
> std::map<std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> >, ceph::buffer::v15_2_0::list,
> std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > >,
> std::allocator<std::pair<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const,
> ceph::buffer::v15_2_0::list> > > const*)+0x328) [0x558cc2adee48]
>
>  8: (PrimaryLogPG::find_object_context(hobject_t const&,
> std::shared_ptr<ObjectContext>*, bool, bool, hobject_t*)+0x20f)
> [0x558cc2aeac3f]
>
>  9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2661)
> [0x558cc2b353f1]
>
>  10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0xcc7) [0x558cc2b42347]
>
>  11: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b)
> [0x558cc29c6d9b]
>
>  12: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6a) [0x558cc2c29b9a]
>
>  13: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0xd1e) [0x558cc29e4dbe]
>
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac)
> [0x558cc306a75c]
>
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558cc306dc20]
>
>  16: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f11b3ac76db]
>
>  17: clone()
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> There was one bug fixed in 16.2.8 related to *BlueStore::Onode::put():*
> *"*os/bluestore: avoid premature onode release (pr#44723<https://github.com/ceph/ceph/pull/44723> <https://github.com/ceph/ceph/pull/44723>, Igor Fedotov)"
> an in tracker: https://tracker.ceph.com/issues/53608
> Is this segfault related to the bug? Is this new?
>
> We have upgraded in May '22 from 15.2.13 to 16.2.8 and in 2 days after to
> 16.2.9 on the cluster with crashes.
> 6 seg faults are on 2tb disks, 8 are on 1tb disks. 2TB are newer (below
> 2yo).
> Could be related to hardware?
> Thank you!
>
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 | https://www.binovo.es
> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
> https://www.youtube.com/user/CANALBINOVOhttps://www.linkedin.com/company/37269706/
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx