OSD respawning -- FAILED assert(clone_size.count(clone))

Chris Taylor <ctaylor@xxxxxxxxxx> · Fri, 28 Aug 2015 11:55:10 -0700

Fellow Ceph Users,

I have 3 OSD nodes and 3 MONs on separate servers. Our storage was near 
full on some OSD so we added additional drives, almost doubling our 
space. Since then we are getting OSDs that are respawning. We added 
additional RAM to the OSD nodes, from 12G to 24G. It started with one 
OSD respawning so we reweighted it to 0, then removed it from the 
CRUSHMAP. Then another OSD started doing the same thing. We repeated the 
process again. We are now one the 5th drive. Some of these drives are 
new and we're thinking there may be something else going on than just 
bad/corrupt drives. Can someone please help?

Here are some snippets from the logs files. If there is any thing else 
you want to see let me know.

Thanks,
Chris

ceph-osd-03:syslog

Aug 27 11:33:37 ceph-osd-03 kernel: [380304.744712] init: ceph-osd 
(ceph/23) main process (458) killed by ABRT signal
Aug 27 11:33:37 ceph-osd-03 kernel: [380304.744736] init: ceph-osd 
(ceph/23) main process ended, respawning
Aug 27 11:33:49 ceph-osd-03 kernel: [380315.871768] init: ceph-osd 
(ceph/23) main process (938) killed by ABRT signal
Aug 27 11:33:49 ceph-osd-03 kernel: [380315.871791] init: ceph-osd 
(ceph/23) main process ended, respawning
Aug 27 11:34:00 ceph-osd-03 kernel: [380327.527056] init: ceph-osd 
(ceph/23) main process (1463) killed by ABRT signal
Aug 27 11:34:00 ceph-osd-03 kernel: [380327.527079] init: ceph-osd 
(ceph/23) main process ended, respawning
Aug 27 11:34:13 ceph-osd-03 kernel: [380340.159178] init: ceph-osd 
(ceph/23) main process (1963) killed by ABRT signal
Aug 27 11:34:13 ceph-osd-03 kernel: [380340.159228] init: ceph-osd 
(ceph/23) main process ended, respawning
Aug 27 11:34:24 ceph-osd-03 kernel: [380350.843268] init: ceph-osd 
(ceph/23) main process (2478) killed by ABRT signal
Aug 27 11:34:24 ceph-osd-03 kernel: [380350.843282] init: ceph-osd 
(ceph/23) main process ended, respawning

ceph-osd-03:ceph-osd.23.log

   -11> 2015-08-27 11:37:53.054359 7fef96fa4700  1 -- 
10.21.0.23:6802/10219 <== osd.10 10.21.0.22:6800/3623 21 ==== 
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] 
snapc=0=[]) v11 ==== 1226+0+15250 (4145454591 0 2457813372) 0x10229600 
con 0xf885440
   -10> 2015-08-27 11:37:53.054407 7fef96fa4700  5 -- op tracker -- 
seq: 390, time: 2015-08-27 11:37:53.054223, event: header_read, op: 
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] 
snapc=0=[])
    -9> 2015-08-27 11:37:53.054420 7fef96fa4700  5 -- op tracker -- 
seq: 390, time: 2015-08-27 11:37:53.054226, event: throttled, op: 
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] 
snapc=0=[])
    -8> 2015-08-27 11:37:53.054427 7fef96fa4700  5 -- op tracker -- 
seq: 390, time: 2015-08-27 11:37:53.054348, event: all_read, op: 
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] 
snapc=0=[])
    -7> 2015-08-27 11:37:53.054434 7fef96fa4700  5 -- op tracker -- 
seq: 390, time: 0.000000, event: dispatched, op: 
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] 
snapc=0=[])
    -6> 2015-08-27 11:37:53.054571 7fef9ecdb700  5 -- op tracker -- 
seq: 390, time: 2015-08-27 11:37:53.054570, event: reached_pg, op: 
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] 
snapc=0=[])
    -5> 2015-08-27 11:37:53.054606 7fef9ecdb700  5 -- op tracker -- 
seq: 390, time: 2015-08-27 11:37:53.054606, event: started, op: 
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] 
snapc=0=[])
    -4> 2015-08-27 11:37:53.054737 7fef9ecdb700  5 -- op tracker -- 
seq: 390, time: 2015-08-27 11:37:53.054737, event: done, op: 
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] 
snapc=0=[])
    -3> 2015-08-27 11:37:53.054832 7fef9dcd9700  2 osd.23 pg_epoch: 
141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941] 
local-les=141062 n=11076 ec=101 les/c 141062/141062 
141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod 
141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep] 
scrub_compare_maps   osd.23 has 25 items
    -2> 2015-08-27 11:37:53.054877 7fef9dcd9700  2 osd.23 pg_epoch: 
141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941] 
local-les=141062 n=11076 ec=101 les/c 141062/141062 
141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod 
141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep] 
scrub_compare_maps replica 10 has 25 items
    -1> 2015-08-27 11:37:53.055207 7fef9dcd9700  2 osd.23 pg_epoch: 
141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941] 
local-les=141062 n=11076 ec=101 les/c 141062/141062 
141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod 
141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep]
     0> 2015-08-27 11:37:53.060242 7fef9dcd9700 -1 osd/osd_types.cc: In 
function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 
7fef9dcd9700 time 2015-08-27 11:37:53.055278
osd/osd_types.cc: 4074: FAILED assert(clone_size.count(clone))

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x8b) [0xbc2b8b]
 2: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x79da36]
 3: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, 
std::pair<unsigned int, unsigned int>, std::less<hobject_t>, 
std::allocator<std::pair<hobject_t const, std::pair<unsigned int, 
unsigned int> > > > const&)+0xa2a) [0x88491a]
 4: (PG::scrub_compare_maps()+0xd89) [0x7f3c39]
 5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x7f6d9e]
 6: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f870e]
 7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6cca59]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb3d1e]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4dc0]
 10: (()+0x8182) [0x7fefc0fb9182]
 11: (clone()+0x6d) [0x7fefbf52447d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

root@ceph-admin:~# ceph -s
    cluster d960d672-e035-413d-ba39-8341f4131760
     health HEALTH_WARN
            3 pgs backfill
            2 pgs backfilling
            5 pgs stuck unclean
            2 requests are blocked > 32 sec
            recovery 363/10993480 objects degraded (0.003%)
            recovery 181602/10993480 objects misplaced (1.652%)
            pool libvirt-pool pg_num 512 > pgp_num 412
     monmap e1: 3 mons at 
{ceph-mon1=10.20.0.11:6789/0,ceph-mon2=10.20.0.12:6789/0,ceph-mon3=10.20.0.13:6789/0}
            election epoch 4780, quorum 0,1,2 ceph-mon1,ceph-mon2,ceph-mon3
     osdmap e141113: 46 osds: 42 up, 42 in; 5 remapped pgs
      pgmap v17289866: 1600 pgs, 4 pools, 20158 GB data, 5316 kobjects
            42081 GB used, 32394 GB / 74476 GB avail
            363/10993480 objects degraded (0.003%)
            181602/10993480 objects misplaced (1.652%)
                1595 active+clean
                   3 active+remapped+wait_backfill
                   2 active+remapped+backfilling
recovery io 83489 kB/s, 20 objects/s
  client io 13079 kB/s rd, 24569 kB/s wr, 113 op/s
root@ceph-admin:~#

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com