osd going down every 15m blocking recovery from degraded state

christopher.thorjussen@xxxxxxxxxxxxxxxxxxxxxxx (Christopher Thorjussen) · Sun, 14 Sep 2014 20:34:47 +0200

I'm waiting for my cluster to recover from a crashed disk and a second osd
that has been taken out (crushmap, rm, stopped).

Now I'm stuck at looking at this output ('ceph -w') while my osd.58 goes
down every 15 minute.

2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
active, 23888 active+clean, 2 active+remapped+backfilling, 301
active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
148288/25473878 degraded (0.582%)
2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
active, 23888 active+clean, 2 active+remapped+backfilling, 301
active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
148288/25473878 degraded (0.582%)
2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
active, 23888 active+clean, 2 active+remapped+backfilling, 301
active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
148288/25473878 degraded (0.582%)
2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
active, 23888 active+clean, 2 active+remapped+backfilling, 301
active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
148288/25473878 degraded (0.582%)

Here is a log from when I restarted osd.58 and through the next reboot 15
minutes later: http://pastebin.com/rt64vx9M
Short, it just waits for 15 minutes not doing anything and then goes down
putting lots of lines like this in the log for that osd:

2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812/27234 >>
10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159 cs=1 l=0
c=0x35bcf1e0).fault with nothing to send, going to standby
2014-09-14 20:02:08.519312 7fbd37b85700  0 -- 10.47.18.33:6812/27234 >>
10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370 cs=1 l=0
c=0x36cc4f20).fault with nothing to send, going to standby

Then I have to restart it. And it repeats.

What should/can I do? Take it out?

I've got 4 servers with 24 disks each. Details about servers:
http://pastebin.com/XQeSh8gJ
Running dumpling - 0.67.10

Cheers,
Christopher Thorjussen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140914/be0a1ef5/attachment.htm>