Hi all, I have a production cluster, I recently purged all snaps. Now on a set of OSD's when they backfill im getting an assert like the below : -4> 2019-08-13 00:25:14.577 7ff4637b1700 5 osd.99 pg_epoch: 206049 pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les =206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 pi= [205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] exit Started /Primary/Active/WaitRemoteBackfillReserved 0.244929 1 0.000064 -3> 2019-08-13 00:25:14.577 7ff4637b1700 5 osd.99 pg_epoch: 206049 pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les =206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 pi= [205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] enter Started/Primary/Active/Backfilling -2> 2019-08-13 00:25:14.653 7ff4637b1700 5 osd.99 pg_epoch: 206049 pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 pi=[205889,206046)/1 rops=1 crt=206047'25372641 lcod 206047'25372640 mlcod 206047'25372640 active+undersized+remapped+backfilling mbc={} ps=80] backfill_pos is 0:b74d67be:::rbd_data.dae7bc6b8b4567.000000000000b4b8:head -1> 2019-08-13 00:25:14.757 7ff4637b1700 -1 /root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7ff4637b1700 time 2019-08-13 00:25:14.759270 /root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: 5263: FAILED ceph_assert(clone_overlap.count(clone)) ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55989e4a6450] 2: (()+0x517628) [0x55989e4a6628] 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62] 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x297) [0x55989e7b2197] 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xfdc) [0x55989e7e059c] 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x110b) [0x55989e7e468b] 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x302) [0x55989e639192] 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55989ec2ba74] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470] 12: (()+0x7fa3) [0x7ff47f718fa3] 13: (clone()+0x3f) [0x7ff47f2c84cf] 0> 2019-08-13 00:25:14.761 7ff4637b1700 -1 *** Caught signal (Aborted) ** in thread 7ff4637b1700 thread_name:tp_osd_tp ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus (stable) 1: (()+0x12730) [0x7ff47f723730] 2: (gsignal()+0x10b) [0x7ff47f2067bb] 3: (abort()+0x121) [0x7ff47f1f1535] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x55989e4a64a1] 5: (()+0x517628) [0x55989e4a6628] 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62] 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x297) [0x55989e7b2197] 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xfdc) [0x55989e7e059c] 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x110b) [0x55989e7e468b] 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x302) [0x55989e639192] 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9] 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7] 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55989ec2ba74] 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470] 15: (()+0x7fa3) [0x7ff47f718fa3] 16: (clone()+0x3f) [0x7ff47f2c84cf] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. FAILED ceph_assert(clone_overlap.count(clone) if possible id like to 'nuke' this from the osd as there is no snap's active, however would love some advise on the best way to go about this. best regards Kevin Myers _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com