Hello all! I'm got repeatedlyOSD crashes for 3 of OSD which stracktrace: ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable) 1: (()+0x911e70) [0x564d0067fe70] 2: (()+0xf5d0) [0x7f1272dad5d0] 3: (gsignal()+0x37) [0x7f1271dce2c7] 4: (abort()+0x148) [0x7f1271dcf9b8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x242) [0x7f12762252b2] 6: (()+0x25a337) [0x7f1276225337] 7: (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x930) [0x564d002ab480] 8: (PrimaryLogPG::hit_set_persist()+0xa0c) [0x564d002afafc] 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2989) [0x564d002c5f09] 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xc99) [0x564d002cac09] 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1b7) [0x564d00124c87] 12: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x564d0039d8c2] 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x592) [0x564d00144ae2] 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) [0x7f127622aec3] 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f127622bab0] 16: (()+0x7dd5) [0x7f1272da5dd5] 17: (clone()+0x6d) [0x7f1271e95f6d] I think this happens after some network troubles and one PG was marked recovery_unfound, to solve this problem this command was executed: ceph pg 2.f8 mark_unfound_lost revert Also last lines in debug output before crash are related to this PG: -10001> 2020-04-23 14:16:18.790 7f534915a700 10 osd.12 pg_epoch: 10476 pg[2.f8( v 8673'30651498 (6953'30648450,8673'30651498] local-lis/les=10475/10476 n=14 ec=66/66 lis/c 10475/10427 les/c/f 10476/10428/201 10475/10475/10473) [12,13,17] r=0 lpr=10475 pi=[10427,10475)/2 crt=8673'30651498 lcod 0'0 mlcod 0'0 active mbc={}] get_object_context: obc NOT found in cache: 2:1f000000:.ceph-internal::hit_set_2.f8_archive_2020-04-22 02%3a57%3a10.496532Z_2020-04-22 03%3a57%3a11.211949Z:head -10001> 2020-04-23 14:16:18.790 7f534915a700 10 osd.12 pg_epoch: 10476 pg[2.f8( v 8673'30651498 (6953'30648450,8673'30651498] local-lis/les=10475/10476 n=14 ec=66/66 lis/c 10475/10427 les/c/f 10476/10428/201 10475/10475/10473) [12,13,17] r=0 lpr=10475 pi=[10427,10475)/2 crt=8673'30651498 lcod 0'0 mlcod 0'0 active mbc={}] get_object_context: no obc for soid 2:1f000000:.ceph-internal::hit_set_2.f8_archive_2020-04-22 02%3a57%3a10.496532Z_2020-04-22 03%3a57%3a11.211949Z:head and !can_create I'm evicted must of PG's from cache pool, and currently we have only two rados objects in this PG: # rados --pgid 2.f8 ls rbd_data.10d4416b8b4567.0000000000001dfb rbd_header.1db946b8b4567 I tried to remove one of them: rados -p vms-cache rm rbd_header.1db946b8b4567 But command not completed for last ~8 hours, because repeatedly OSD crashes. Does we have any ways to solve this problem without data loss? -- MATPOCKuH _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx