OSD respawning -- FAILED assert(clone_size.count(clone))

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Fellow Ceph Users,

I have 3 OSD nodes and 3 MONs on separate servers. Our storage was near full on some OSD so we added additional drives, almost doubling our space. Since then we are getting OSDs that are respawning. We added additional RAM to the OSD nodes, from 12G to 24G. It started with one OSD respawning so we reweighted it to 0, then removed it from the CRUSHMAP. Then another OSD started doing the same thing. We repeated the process again. We are now one the 5th drive. Some of these drives are new and we're thinking there may be something else going on than just bad/corrupt drives. Can someone please help?

Here are some snippets from the logs files. If there is any thing else you want to see let me know.

Thanks,
Chris


ceph-osd-03:syslog

Aug 27 11:33:37 ceph-osd-03 kernel: [380304.744712] init: ceph-osd (ceph/23) main process (458) killed by ABRT signal Aug 27 11:33:37 ceph-osd-03 kernel: [380304.744736] init: ceph-osd (ceph/23) main process ended, respawning Aug 27 11:33:49 ceph-osd-03 kernel: [380315.871768] init: ceph-osd (ceph/23) main process (938) killed by ABRT signal Aug 27 11:33:49 ceph-osd-03 kernel: [380315.871791] init: ceph-osd (ceph/23) main process ended, respawning Aug 27 11:34:00 ceph-osd-03 kernel: [380327.527056] init: ceph-osd (ceph/23) main process (1463) killed by ABRT signal Aug 27 11:34:00 ceph-osd-03 kernel: [380327.527079] init: ceph-osd (ceph/23) main process ended, respawning Aug 27 11:34:13 ceph-osd-03 kernel: [380340.159178] init: ceph-osd (ceph/23) main process (1963) killed by ABRT signal Aug 27 11:34:13 ceph-osd-03 kernel: [380340.159228] init: ceph-osd (ceph/23) main process ended, respawning Aug 27 11:34:24 ceph-osd-03 kernel: [380350.843268] init: ceph-osd (ceph/23) main process (2478) killed by ABRT signal Aug 27 11:34:24 ceph-osd-03 kernel: [380350.843282] init: ceph-osd (ceph/23) main process ended, respawning



ceph-osd-03:ceph-osd.23.log

-11> 2015-08-27 11:37:53.054359 7fef96fa4700 1 -- 10.21.0.23:6802/10219 <== osd.10 10.21.0.22:6800/3623 21 ==== osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 ==== 1226+0+15250 (4145454591 0 2457813372) 0x10229600 con 0xf885440 -10> 2015-08-27 11:37:53.054407 7fef96fa4700 5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054223, event: header_read, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -9> 2015-08-27 11:37:53.054420 7fef96fa4700 5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054226, event: throttled, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -8> 2015-08-27 11:37:53.054427 7fef96fa4700 5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054348, event: all_read, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -7> 2015-08-27 11:37:53.054434 7fef96fa4700 5 -- op tracker -- seq: 390, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -6> 2015-08-27 11:37:53.054571 7fef9ecdb700 5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054570, event: reached_pg, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -5> 2015-08-27 11:37:53.054606 7fef9ecdb700 5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054606, event: started, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -4> 2015-08-27 11:37:53.054737 7fef9ecdb700 5 -- op tracker -- seq: 390, time: 2015-08-27 11:37:53.054737, event: done, op: osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) -3> 2015-08-27 11:37:53.054832 7fef9dcd9700 2 osd.23 pg_epoch: 141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941] local-les=141062 n=11076 ec=101 les/c 141062/141062 141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod 141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep] scrub_compare_maps osd.23 has 25 items -2> 2015-08-27 11:37:53.054877 7fef9dcd9700 2 osd.23 pg_epoch: 141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941] local-les=141062 n=11076 ec=101 les/c 141062/141062 141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod 141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep] scrub_compare_maps replica 10 has 25 items -1> 2015-08-27 11:37:53.055207 7fef9dcd9700 2 osd.23 pg_epoch: 141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941] local-les=141062 n=11076 ec=101 les/c 141062/141062 141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod 141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep] 0> 2015-08-27 11:37:53.060242 7fef9dcd9700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fef9dcd9700 time 2015-08-27 11:37:53.055278
osd/osd_types.cc: 4074: FAILED assert(clone_size.count(clone))

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc2b8b]
 2: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x79da36]
3: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, std::pair<unsigned int, unsigned int>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::pair<unsigned int, unsigned int> > > > const&)+0xa2a) [0x88491a]
 4: (PG::scrub_compare_maps()+0xd89) [0x7f3c39]
 5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x7f6d9e]
 6: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f870e]
 7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6cca59]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb3d1e]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4dc0]
 10: (()+0x8182) [0x7fefc0fb9182]
 11: (clone()+0x6d) [0x7fefbf52447d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


root@ceph-admin:~# ceph -s
    cluster d960d672-e035-413d-ba39-8341f4131760
     health HEALTH_WARN
            3 pgs backfill
            2 pgs backfilling
            5 pgs stuck unclean
            2 requests are blocked > 32 sec
            recovery 363/10993480 objects degraded (0.003%)
            recovery 181602/10993480 objects misplaced (1.652%)
            pool libvirt-pool pg_num 512 > pgp_num 412
monmap e1: 3 mons at {ceph-mon1=10.20.0.11:6789/0,ceph-mon2=10.20.0.12:6789/0,ceph-mon3=10.20.0.13:6789/0}
            election epoch 4780, quorum 0,1,2 ceph-mon1,ceph-mon2,ceph-mon3
     osdmap e141113: 46 osds: 42 up, 42 in; 5 remapped pgs
      pgmap v17289866: 1600 pgs, 4 pools, 20158 GB data, 5316 kobjects
            42081 GB used, 32394 GB / 74476 GB avail
            363/10993480 objects degraded (0.003%)
            181602/10993480 objects misplaced (1.652%)
                1595 active+clean
                   3 active+remapped+wait_backfill
                   2 active+remapped+backfilling
recovery io 83489 kB/s, 20 objects/s
  client io 13079 kB/s rd, 24569 kB/s wr, 113 op/s
root@ceph-admin:~#




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux