OSD daemon randomly stops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



OSD has randomly stopped for some reason. Lots of recovery processes currently running on the ceph cluster. OSD log with assert below:

-14> 2016-09-02 11:32:38.672460 7fcf65514700  5 -- op tracker -- seq: 1147, time: 2016-09-02 11:32:38.672460, event: queued_for_pg, op: osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
   -13> 2016-09-02 11:32:38.672533 7fcf70d40700  5 -- op tracker -- seq: 1147, time: 2016-09-02 11:32:38.672533, event: reached_pg, op: osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
   -12> 2016-09-02 11:32:38.672548 7fcf70d40700  5 -- op tracker -- seq: 1147, time: 2016-09-02 11:32:38.672548, event: started, op: osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
   -11> 2016-09-02 11:32:38.672548 7fcf7cd58700  1 -- [].28:6800/27735 <== mon.0 [].249:6789/0 60 ==== pg_stats_ack(0 pgs tid 45) v1 ==== 4+0+0 (0 0 0) 0x55a4443b1400 con 0x55a4434a4e80
   -10> 2016-09-02 11:32:38.672559 7fcf70d40700  1 -- [].28:6801/27735 --> [].31:6801/2070838 -- osd_sub_op(unknown.0.0:0 7.d1 MIN [scrub-unreserve] v 0'0 snapset=0=[]:[]) v12 -- ?+0 0x55a443aec100 con 0x55a443be0600
    -9> 2016-09-02 11:32:38.672571 7fcf70d40700  5 -- op tracker -- seq: 1147, time: 2016-09-02 11:32:38.672571, event: done, op: osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
    -8> 2016-09-02 11:32:38.681929 7fcf7b555700  1 -- [].28:6801/27735 <== osd.2 [].26:6801/9468 148 ==== MBackfillReserve GRANT  pgid: 15.11, query_epoch: 4235 v3 ==== 30+0+0 (3067148394 0 0) 0x55a4441f65a0 con 0x55a4434ab200
    -7> 2016-09-02 11:32:38.682009 7fcf7b555700  5 -- op tracker -- seq: 1148, time: 2016-09-02 11:32:38.682008, event: done, op: MBackfillReserve GRANT  pgid: 15.11, query_epoch: 4235
    -6> 2016-09-02 11:32:38.682068 7fcf73545700  5 osd.4 pg_epoch: 4235 pg[15.11( v 895'400028 (859'397021,895'400028] local-les=4234 n=166739 ec=732 les/c/f 4234/4003/0 4232/4233/4233) [2,4]/[4] r=0 lpr=4233 pi=4002-4232/47 (log bound mismatch
, actual=[859'396822,895'400028]) bft=2 crt=895'400028 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+wait_backfill] exit Started/Primary/Active/WaitRemoteBackfillReserved 221.748180 6 0.000056
    -5> 2016-09-02 11:32:38.682109 7fcf73545700  5 osd.4 pg_epoch: 4235 pg[15.11( v 895'400028 (859'397021,895'400028] local-les=4234 n=166739 ec=732 les/c/f 4234/4003/0 4232/4233/4233) [2,4]/[4] r=0 lpr=4233 pi=4002-4232/47 (log bound mismatch
, actual=[859'396822,895'400028]) bft=2 crt=895'400028 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+wait_backfill] enter Started/Primary/Active/Backfilling
    -4> 2016-09-02 11:32:38.682584 7fcf7b555700  1 -- [].28:6801/27735 <== osd.6 [].30:6801/44406 171 ==== osd pg remove(epoch 4235; pg6.19; ) v2 ==== 30+0+0 (522063165 0 0) 0x55a44392f680 con 0x55a443bae100
    -3> 2016-09-02 11:32:38.682600 7fcf7b555700  5 -- op tracker -- seq: 1149, time: 2016-09-02 11:32:38.682600, event: started, op: osd pg remove(epoch 4235; pg6.19; )
    -2> 2016-09-02 11:32:38.682616 7fcf7b555700  5 osd.4 4235 queue_pg_for_deletion: 6.19
    -1> 2016-09-02 11:32:38.685425 7fcf7b555700  5 -- op tracker -- seq: 1149, time: 2016-09-02 11:32:38.685421, event: done, op: osd pg remove(epoch 4235; pg6.19; )
     0> 2016-09-02 11:32:38.690487 7fcf6c537700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::scan_range(int, int, PG::BackfillInterval*, ThreadPool::TPHandle&)' thread 7fcf6c537700 time 2016-09-02 11:32:38.688536
osd/ReplicatedPG.cc: 11345: FAILED assert(r >= 0)

 2016-09-02 11:32:38.711869 7fcf6c537700 -1 *** Caught signal (Aborted) **
 in thread 7fcf6c537700 thread_name:tp_osd_recov

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x8ebb02) [0x55a402375b02]
 2: (()+0x10330) [0x7fcfa2b51330]
 3: (gsignal()+0x37) [0x7fcfa0bb3c37]
 4: (abort()+0x148) [0x7fcfa0bb7028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x55a40246cf85]
 6: (ReplicatedPG::scan_range(int, int, PG::BackfillInterval*, ThreadPool::TPHandle&)+0xad2) [0x55a401f4f482]
 7: (ReplicatedPG::update_range(PG::BackfillInterval*, ThreadPool::TPHandle&)+0x614) [0x55a401f4fac4]
 8: (ReplicatedPG::recover_backfill(int, ThreadPool::TPHandle&, bool*)+0x337) [0x55a401f6fc87]
 9: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0x8a0) [0x55a401fa1160]
 10: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x355) [0x55a401e31555]
 11: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0xd) [0x55a401e7a0dd]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x55a40245e18e]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x55a40245f070]
 14: (()+0x8184) [0x7fcfa2b49184]
 15: (clone()+0x6d) [0x7fcfa0c7737d]

Any help with this appreciated.

Thanks,

Reed
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux