ceph osd crash help needed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I have a production cluster, I recently purged all snaps.

Now on a set of OSD's when they backfill im getting an assert like the below : 

    -4> 2019-08-13 00:25:14.577 7ff4637b1700  5 osd.99 pg_epoch: 206049 pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] exit Started
/Primary/Active/WaitRemoteBackfillReserved 0.244929 1 0.000064
    -3> 2019-08-13 00:25:14.577 7ff4637b1700  5 osd.99 pg_epoch: 206049 pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] enter Started/Primary/Active/Backfilling
    -2> 2019-08-13 00:25:14.653 7ff4637b1700  5 osd.99 pg_epoch: 206049 pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 pi=[205889,206046)/1 rops=1 crt=206047'25372641 lcod 206047'25372640 mlcod 206047'25372640 active+undersized+remapped+backfilling mbc={} ps=80] backfill_pos is 0:b74d67be:::rbd_data.dae7bc6b8b4567.000000000000b4b8:head
    -1> 2019-08-13 00:25:14.757 7ff4637b1700 -1 /root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7ff4637b1700 time 2019-08-13 00:25:14.759270
/root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: 5263: FAILED ceph_assert(clone_overlap.count(clone))

 ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55989e4a6450]
 2: (()+0x517628) [0x55989e4a6628]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x297) [0x55989e7b2197]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xfdc) [0x55989e7e059c]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x110b) [0x55989e7e468b]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x302) [0x55989e639192]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55989ec2ba74]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
 12: (()+0x7fa3) [0x7ff47f718fa3]
 13: (clone()+0x3f) [0x7ff47f2c84cf]

     0> 2019-08-13 00:25:14.761 7ff4637b1700 -1 *** Caught signal (Aborted) **
 in thread 7ff4637b1700 thread_name:tp_osd_tp

 ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus (stable)
 1: (()+0x12730) [0x7ff47f723730]
 2: (gsignal()+0x10b) [0x7ff47f2067bb]
 3: (abort()+0x121) [0x7ff47f1f1535]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x55989e4a64a1]
 5: (()+0x517628) [0x55989e4a6628]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x297) [0x55989e7b2197]
 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xfdc) [0x55989e7e059c]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x110b) [0x55989e7e468b]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x302) [0x55989e639192]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55989ec2ba74]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
 15: (()+0x7fa3) [0x7ff47f718fa3]
 16: (clone()+0x3f) [0x7ff47f2c84cf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.



 FAILED ceph_assert(clone_overlap.count(clone)

if possible id like to 'nuke' this from the osd as there is no snap's active, however  would love some advise on the best way to go about this. 

best regards
Kevin Myers

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux