Re: 16.2.9 High rate of Segmentation fault on ceph-osd processes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Paul,

Did you restart OSDs after upgrading to 16.2.9 (you can just check with "ceph versions") ?

All crashes show similar backtrace with BlueStore::Onode::put() ?

Cheers

El 10/8/22 a las 14:22, Paul JURCO escribió:
Hi,
We have two similar clusters in number of hosts and disks, about the same
age with pacific 16.2.9.
Both have a mix of hosts with 1TB and 2TB  disks (disks' capacity is not
mixed on hosts for OSDs).
One of the clusters has 21 osd process crashes in the last 7 days, the
other has just 3.
Full stack as reported in ceph-osd log:

2022-08-03T06:39:30.987+0300 7f118dd4d700 -1 *** Caught signal (*Segme**ntation
fault*) **

  in thread 7f118dd4d700 *thread_name:tp_osd_tp*


  ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
(stable)

  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f11b3ad2980]

  2: (ceph::buffer::v15_2_0::ptr::release()+0x2d) [0x558cc32238ad]

  3: (*BlueStore::Onode::put()*+0x1bc) [0x558cc2ea321c]

  4: (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
ghobject_t const&, char const*, ceph::buffer::v15_2_0::ptr&)+0x275)
[0x558cc2ed1525]

  5: (PGBackend::objects_get_attr(hobject_t const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0xc7)
[0x558cc2b97ca7]

  6: (PrimaryLogPG::get_snapset_context(hobject_t const&, bool,
std::map<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, ceph::buffer::v15_2_0::list,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const,
ceph::buffer::v15_2_0::list> > > const*, bool)+0x3bd) [0x558cc2ade74d]

  7: (PrimaryLogPG::get_object_context(hobject_t const&, bool,
std::map<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, ceph::buffer::v15_2_0::list,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const,
ceph::buffer::v15_2_0::list> > > const*)+0x328) [0x558cc2adee48]

  8: (PrimaryLogPG::find_object_context(hobject_t const&,
std::shared_ptr<ObjectContext>*, bool, bool, hobject_t*)+0x20f)
[0x558cc2aeac3f]

  9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2661)
[0x558cc2b353f1]

  10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xcc7) [0x558cc2b42347]

  11: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b)
[0x558cc29c6d9b]

  12: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6a) [0x558cc2c29b9a]

  13: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xd1e) [0x558cc29e4dbe]

  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac)
[0x558cc306a75c]

  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558cc306dc20]

  16: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f11b3ac76db]

  17: clone()
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

There was one bug fixed in 16.2.8 related to *BlueStore::Onode::put():*
*"*os/bluestore: avoid premature onode release (pr#44723
<https://github.com/ceph/ceph/pull/44723>, Igor Fedotov)"
an in tracker:https://tracker.ceph.com/issues/53608
Is this segfault related to the bug? Is this new?

We have upgraded in May '22 from 15.2.13 to 16.2.8 and in 2 days after to
16.2.9 on the cluster with crashes.
6 seg faults are on 2tb disks, 8 are on 1tb disks. 2TB are newer (below
2yo).
Could be related to hardware?
Thank you!

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux