osd going down every 15m blocking recovery from degraded state

christopher.thorjussen@xxxxxxxxxxxxxxxxxxxxxxx (Christopher Thorjussen) · Tue, 16 Sep 2014 23:51:36 +0200

I've got several osds that are spinning at 100%.

I've retained some professional services to have a look. Its out of my
newbie reach..

/Christopher

On Tue, Sep 16, 2014 at 11:23 PM, Craig Lewis <clewis at centraldesktop.com>
wrote:

> Is it using any CPU or Disk I/O during the 15 minutes?
>
> On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen <
> christopher.thorjussen at onlinebackupcompany.com> wrote:
>
>> I'm waiting for my cluster to recover from a crashed disk and a second
>> osd that has been taken out (crushmap, rm, stopped).
>>
>> Now I'm stuck at looking at this output ('ceph -w') while my osd.58 goes
>> down every 15 minute.
>>
>> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>> 148288/25473878 degraded (0.582%)
>> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>> 148288/25473878 degraded (0.582%)
>> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>> 148288/25473878 degraded (0.582%)
>> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>> 148288/25473878 degraded (0.582%)
>>
>> Here is a log from when I restarted osd.58 and through the next reboot 15
>> minutes later: http://pastebin.com/rt64vx9M
>> Short, it just waits for 15 minutes not doing anything and then goes down
>> putting lots of lines like this in the log for that osd:
>>
>> 2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812/27234 >>
>> 10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159 cs=1
>> l=0 c=0x35bcf1e0).fault with nothing to send, going to standby
>> 2014-09-14 20:02:08.519312 7fbd37b85700  0 -- 10.47.18.33:6812/27234 >>
>> 10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370 cs=1
>> l=0 c=0x36cc4f20).fault with nothing to send, going to standby
>>
>> Then I have to restart it. And it repeats.
>>
>> What should/can I do? Take it out?
>>
>> I've got 4 servers with 24 disks each. Details about servers:
>> http://pastebin.com/XQeSh8gJ
>> Running dumpling - 0.67.10
>>
>> Cheers,
>> Christopher Thorjussen
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140916/27a0d2a6/attachment.htm>