osd going down every 15m blocking recovery from degraded state

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Tue, 16 Sep 2014 14:23:39 -0700

Is it using any CPU or Disk I/O during the 15 minutes?

On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen <
christopher.thorjussen at onlinebackupcompany.com> wrote:

> I'm waiting for my cluster to recover from a crashed disk and a second osd
> that has been taken out (crushmap, rm, stopped).
>
> Now I'm stuck at looking at this output ('ceph -w') while my osd.58 goes
> down every 15 minute.
>
> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
> active, 23888 active+clean, 2 active+remapped+backfilling, 301
> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
> 148288/25473878 degraded (0.582%)
>
> Here is a log from when I restarted osd.58 and through the next reboot 15
> minutes later: http://pastebin.com/rt64vx9M
> Short, it just waits for 15 minutes not doing anything and then goes down
> putting lots of lines like this in the log for that osd:
>
> 2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812/27234 >>
> 10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159 cs=1 l=0
> c=0x35bcf1e0).fault with nothing to send, going to standby
> 2014-09-14 20:02:08.519312 7fbd37b85700  0 -- 10.47.18.33:6812/27234 >>
> 10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370 cs=1
> l=0 c=0x36cc4f20).fault with nothing to send, going to standby
>
> Then I have to restart it. And it repeats.
>
> What should/can I do? Take it out?
>
> I've got 4 servers with 24 disks each. Details about servers:
> http://pastebin.com/XQeSh8gJ
> Running dumpling - 0.67.10
>
> Cheers,
> Christopher Thorjussen
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140916/dde09230/attachment.htm>