Fellow Ceph Users,
I have 3 OSD nodes and 3 MONs on separate servers. Our storage was near
full on some OSD so we added additional drives, almost doubling our
space. Since then we are getting OSDs that are respawning. We added
additional RAM to the OSD nodes, from 12G to 24G. It started with one
OSD respawning so we reweighted it to 0, then removed it from the
CRUSHMAP. Then another OSD started doing the same thing. We repeated the
process again. We are now one the 5th drive. Some of these drives are
new and we're thinking there may be something else going on than just
bad/corrupt drives. Can someone please help?
Here are some snippets from the logs files. If there is any thing else
you want to see let me know.
Thanks,
Chris
ceph-osd-03:syslog
Aug 27 11:33:37 ceph-osd-03 kernel: [380304.744712] init: ceph-osd
(ceph/23) main process (458) killed by ABRT signal
Aug 27 11:33:37 ceph-osd-03 kernel: [380304.744736] init: ceph-osd
(ceph/23) main process ended, respawning
Aug 27 11:33:49 ceph-osd-03 kernel: [380315.871768] init: ceph-osd
(ceph/23) main process (938) killed by ABRT signal
Aug 27 11:33:49 ceph-osd-03 kernel: [380315.871791] init: ceph-osd
(ceph/23) main process ended, respawning
Aug 27 11:34:00 ceph-osd-03 kernel: [380327.527056] init: ceph-osd
(ceph/23) main process (1463) killed by ABRT signal
Aug 27 11:34:00 ceph-osd-03 kernel: [380327.527079] init: ceph-osd
(ceph/23) main process ended, respawning
Aug 27 11:34:13 ceph-osd-03 kernel: [380340.159178] init: ceph-osd
(ceph/23) main process (1963) killed by ABRT signal
Aug 27 11:34:13 ceph-osd-03 kernel: [380340.159228] init: ceph-osd
(ceph/23) main process ended, respawning
Aug 27 11:34:24 ceph-osd-03 kernel: [380350.843268] init: ceph-osd
(ceph/23) main process (2478) killed by ABRT signal
Aug 27 11:34:24 ceph-osd-03 kernel: [380350.843282] init: ceph-osd
(ceph/23) main process ended, respawning
ceph-osd-03:ceph-osd.23.log
-11> 2015-08-27 11:37:53.054359 7fef96fa4700 1 --
10.21.0.23:6802/10219 <== osd.10 10.21.0.22:6800/3623 21 ====
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[]) v11 ==== 1226+0+15250 (4145454591 0 2457813372) 0x10229600
con 0xf885440
-10> 2015-08-27 11:37:53.054407 7fef96fa4700 5 -- op tracker --
seq: 390, time: 2015-08-27 11:37:53.054223, event: header_read, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-9> 2015-08-27 11:37:53.054420 7fef96fa4700 5 -- op tracker --
seq: 390, time: 2015-08-27 11:37:53.054226, event: throttled, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-8> 2015-08-27 11:37:53.054427 7fef96fa4700 5 -- op tracker --
seq: 390, time: 2015-08-27 11:37:53.054348, event: all_read, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-7> 2015-08-27 11:37:53.054434 7fef96fa4700 5 -- op tracker --
seq: 390, time: 0.000000, event: dispatched, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-6> 2015-08-27 11:37:53.054571 7fef9ecdb700 5 -- op tracker --
seq: 390, time: 2015-08-27 11:37:53.054570, event: reached_pg, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-5> 2015-08-27 11:37:53.054606 7fef9ecdb700 5 -- op tracker --
seq: 390, time: 2015-08-27 11:37:53.054606, event: started, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-4> 2015-08-27 11:37:53.054737 7fef9ecdb700 5 -- op tracker --
seq: 390, time: 2015-08-27 11:37:53.054737, event: done, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-3> 2015-08-27 11:37:53.054832 7fef9dcd9700 2 osd.23 pg_epoch:
141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941]
local-les=141062 n=11076 ec=101 les/c 141062/141062
141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod
141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep]
scrub_compare_maps osd.23 has 25 items
-2> 2015-08-27 11:37:53.054877 7fef9dcd9700 2 osd.23 pg_epoch:
141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941]
local-les=141062 n=11076 ec=101 les/c 141062/141062
141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod
141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep]
scrub_compare_maps replica 10 has 25 items
-1> 2015-08-27 11:37:53.055207 7fef9dcd9700 2 osd.23 pg_epoch:
141062 pg[3.f9( v 141062'10774941 (137054'10771888,141062'10774941]
local-les=141062 n=11076 ec=101 les/c 141062/141062
141061/141061/141061) [23,10] r=0 lpr=141061 crt=141053'10774936 lcod
141062'10774940 mlcod 141062'10774940 active+clean+scrubbing+deep]
0> 2015-08-27 11:37:53.060242 7fef9dcd9700 -1 osd/osd_types.cc: In
function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread
7fef9dcd9700 time 2015-08-27 11:37:53.055278
osd/osd_types.cc: 4074: FAILED assert(clone_size.count(clone))
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0xbc2b8b]
2: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x79da36]
3: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t,
std::pair<unsigned int, unsigned int>, std::less<hobject_t>,
std::allocator<std::pair<hobject_t const, std::pair<unsigned int,
unsigned int> > > > const&)+0xa2a) [0x88491a]
4: (PG::scrub_compare_maps()+0xd89) [0x7f3c39]
5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x7f6d9e]
6: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7f870e]
7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6cca59]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb3d1e]
9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4dc0]
10: (()+0x8182) [0x7fefc0fb9182]
11: (clone()+0x6d) [0x7fefbf52447d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
root@ceph-admin:~# ceph -s
cluster d960d672-e035-413d-ba39-8341f4131760
health HEALTH_WARN
3 pgs backfill
2 pgs backfilling
5 pgs stuck unclean
2 requests are blocked > 32 sec
recovery 363/10993480 objects degraded (0.003%)
recovery 181602/10993480 objects misplaced (1.652%)
pool libvirt-pool pg_num 512 > pgp_num 412
monmap e1: 3 mons at
{ceph-mon1=10.20.0.11:6789/0,ceph-mon2=10.20.0.12:6789/0,ceph-mon3=10.20.0.13:6789/0}
election epoch 4780, quorum 0,1,2 ceph-mon1,ceph-mon2,ceph-mon3
osdmap e141113: 46 osds: 42 up, 42 in; 5 remapped pgs
pgmap v17289866: 1600 pgs, 4 pools, 20158 GB data, 5316 kobjects
42081 GB used, 32394 GB / 74476 GB avail
363/10993480 objects degraded (0.003%)
181602/10993480 objects misplaced (1.652%)
1595 active+clean
3 active+remapped+wait_backfill
2 active+remapped+backfilling
recovery io 83489 kB/s, 20 objects/s
client io 13079 kB/s rd, 24569 kB/s wr, 113 op/s
root@ceph-admin:~#
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com