osd going down every 15m blocking recovery from degraded state

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Tue, 16 Sep 2014 15:01:00 -0700

I ran into a similar issue before.  I was having a lot of OSD crashes
caused by XFS memory allocation deadlocks.  My OSDs crashed so many times
that they couldn't replay the OSD Map before they would be marked
unresponsive.

See if this sounds familiar:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040002.html

If so, Sage's procedure to apply the osdmaps fixed my cluster:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040176.html

On Tue, Sep 16, 2014 at 2:51 PM, Christopher Thorjussen <
christopher.thorjussen at onlinebackupcompany.com> wrote:

> I've got several osds that are spinning at 100%.
>
> I've retained some professional services to have a look. Its out of my
> newbie reach..
>
> /Christopher
>
> On Tue, Sep 16, 2014 at 11:23 PM, Craig Lewis <clewis at centraldesktop.com>
> wrote:
>
>> Is it using any CPU or Disk I/O during the 15 minutes?
>>
>> On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen <
>> christopher.thorjussen at onlinebackupcompany.com> wrote:
>>
>>> I'm waiting for my cluster to recover from a crashed disk and a second
>>> osd that has been taken out (crushmap, rm, stopped).
>>>
>>> Now I'm stuck at looking at this output ('ceph -w') while my osd.58 goes
>>> down every 15 minute.
>>>
>>> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>> 148288/25473878 degraded (0.582%)
>>> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>> 148288/25473878 degraded (0.582%)
>>> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>> 148288/25473878 degraded (0.582%)
>>> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>> 148288/25473878 degraded (0.582%)
>>>
>>> Here is a log from when I restarted osd.58 and through the next reboot
>>> 15 minutes later: http://pastebin.com/rt64vx9M
>>> Short, it just waits for 15 minutes not doing anything and then goes
>>> down putting lots of lines like this in the log for that osd:
>>>
>>> 2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812/27234 >>
>>> 10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159 cs=1
>>> l=0 c=0x35bcf1e0).fault with nothing to send, going to standby
>>> 2014-09-14 20:02:08.519312 7fbd37b85700  0 -- 10.47.18.33:6812/27234 >>
>>> 10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370 cs=1
>>> l=0 c=0x36cc4f20).fault with nothing to send, going to standby
>>>
>>> Then I have to restart it. And it repeats.
>>>
>>> What should/can I do? Take it out?
>>>
>>> I've got 4 servers with 24 disks each. Details about servers:
>>> http://pastebin.com/XQeSh8gJ
>>> Running dumpling - 0.67.10
>>>
>>> Cheers,
>>> Christopher Thorjussen
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140916/b5483947/attachment.htm>