Re: OSD crashed while reparing inconsistent PG luminous

Cassiano Pilipavicius <cassiano@xxxxxxxxxxx> · Tue, 17 Oct 2017 14:58:26 -0200

Hello, I have a problem with OSDs crashing after upgrading to 
bluestore/luminous, due to the fact that I was using JEMALLOC and it 
seems that there is a bug on bluestore osds x jemalloc. Changing to 
tcmalloc solved my issues. Dont know if you have the same issue, but in 
my environment, the osds crashed mainly while repairing, or when I have 
a high load on the cluster.

Em 10/17/2017 2:51 PM, Ana Aviles escreveu:
Hello all,

We had an inconsistent PG on our cluster. While performing PG repair
operation the OSD crashed. The OSD was not able to start again anymore,
and there was no hardware failure on the disk itself. This is the log output

2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 3 errors, 1 fixed
2017-10-17 17:48:56.047896 7f234930d700 -1
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
thread 7f234930d700 time 2017-10-17 17:48:55.924115
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())

  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x56236c8ff3f2]
  2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
const&, std::shared_ptr<ObjectContext>, bool,
ObjectStore::Transaction*)+0xd63) [0x56236c476213]
  3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info,
std::allocator<ReplicatedBackend::pull_complete_info> >*,
ObjectStore::Transaction*)+0x693) [0x56236c60d4d3]
  4:
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2b5)
[0x56236c60dd75]
  5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x20c)
[0x56236c61196c]
  6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x56236c521aa0]
  7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x55d) [0x56236c48662d]
  8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9)
[0x56236c3091a9]
  9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x56236c5a2ae7]
  10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de]
  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
[0x56236c9041e4]
  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220]
  13: (()+0x76ba) [0x7f2366be96ba]
  14: (clone()+0x6d) [0x7f2365c603dd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

Thanks!

Ana

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com