Repeatedly OSD crashes in PrimaryLogPG::hit_set_trim()

KOT MATPOCKuH <matpockuh@xxxxxxxxx> · Fri, 24 Apr 2020 10:21:28 +0300

Hello all!

I'm got repeatedlyOSD crashes for 3 of OSD which stracktrace:

 ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
 1: (()+0x911e70) [0x564d0067fe70]
 2: (()+0xf5d0) [0x7f1272dad5d0]
 3: (gsignal()+0x37) [0x7f1271dce2c7]
 4: (abort()+0x148) [0x7f1271dcf9b8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x242) [0x7f12762252b2]
 6: (()+0x25a337) [0x7f1276225337]
 7: (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x930)
[0x564d002ab480]
 8: (PrimaryLogPG::hit_set_persist()+0xa0c) [0x564d002afafc]
 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2989)
[0x564d002c5f09]
 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xc99) [0x564d002cac09]
 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1b7)
[0x564d00124c87]
 12: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x62) [0x564d0039d8c2]
 13: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x592) [0x564d00144ae2]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3)
[0x7f127622aec3]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f127622bab0]
 16: (()+0x7dd5) [0x7f1272da5dd5]
 17: (clone()+0x6d) [0x7f1271e95f6d]

I think this happens after some network troubles and one PG was marked
recovery_unfound, to solve this problem this command was executed:
ceph pg 2.f8 mark_unfound_lost revert

Also last lines in debug output before crash are related to this PG:
-10001> 2020-04-23 14:16:18.790 7f534915a700 10 osd.12 pg_epoch: 10476
pg[2.f8( v 8673'30651498 (6953'30648450,8673'30651498]
local-lis/les=10475/10476 n=14 ec=66/66 lis/c 10475/10427 les/c/f
10476/10428/201 10475/10475/10473) [12,13,17] r=0 lpr=10475
pi=[10427,10475)/2 crt=8673'30651498 lcod 0'0 mlcod 0'0 active mbc={}]
get_object_context: obc NOT found in cache:
2:1f000000:.ceph-internal::hit_set_2.f8_archive_2020-04-22
02%3a57%3a10.496532Z_2020-04-22 03%3a57%3a11.211949Z:head
-10001> 2020-04-23 14:16:18.790 7f534915a700 10 osd.12 pg_epoch: 10476
pg[2.f8( v 8673'30651498 (6953'30648450,8673'30651498]
local-lis/les=10475/10476 n=14 ec=66/66 lis/c 10475/10427 les/c/f
10476/10428/201 10475/10475/10473) [12,13,17] r=0 lpr=10475
pi=[10427,10475)/2 crt=8673'30651498 lcod 0'0 mlcod 0'0 active mbc={}]
get_object_context: no obc for soid
2:1f000000:.ceph-internal::hit_set_2.f8_archive_2020-04-22
02%3a57%3a10.496532Z_2020-04-22 03%3a57%3a11.211949Z:head and !can_create

I'm evicted must of PG's from cache pool, and currently we have only two
rados objects in this PG:
# rados --pgid 2.f8 ls
rbd_data.10d4416b8b4567.0000000000001dfb
rbd_header.1db946b8b4567

I tried to remove one of them:
rados -p vms-cache rm rbd_header.1db946b8b4567

But command not completed for last ~8 hours, because repeatedly OSD crashes.

Does we have any ways to solve this problem without data loss?

-- 
MATPOCKuH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx