osd going down every 15m blocking recovery from degraded state

christopher.thorjussen@xxxxxxxxxxxxxxxxxxxxxxx (Christopher Thorjussen) · Wed, 17 Sep 2014 00:19:12 +0200

I've been throught your post many times (google likes it ;)
I've been trying all the noout/nodown/noup.
But I will look into the XFS issue you are talking about. And read all of
the post one more time..

/C

On Wed, Sep 17, 2014 at 12:01 AM, Craig Lewis <clewis at centraldesktop.com>
wrote:

> I ran into a similar issue before.  I was having a lot of OSD crashes
> caused by XFS memory allocation deadlocks.  My OSDs crashed so many times
> that they couldn't replay the OSD Map before they would be marked
> unresponsive.
>
> See if this sounds familiar:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040002.html
>
> If so, Sage's procedure to apply the osdmaps fixed my cluster:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040176.html
>
>
>
>
>
> On Tue, Sep 16, 2014 at 2:51 PM, Christopher Thorjussen <
> christopher.thorjussen at onlinebackupcompany.com> wrote:
>
>> I've got several osds that are spinning at 100%.
>>
>> I've retained some professional services to have a look. Its out of my
>> newbie reach..
>>
>> /Christopher
>>
>> On Tue, Sep 16, 2014 at 11:23 PM, Craig Lewis <clewis at centraldesktop.com>
>> wrote:
>>
>>> Is it using any CPU or Disk I/O during the 15 minutes?
>>>
>>> On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen <
>>> christopher.thorjussen at onlinebackupcompany.com> wrote:
>>>
>>>> I'm waiting for my cluster to recover from a crashed disk and a second
>>>> osd that has been taken out (crushmap, rm, stopped).
>>>>
>>>> Now I'm stuck at looking at this output ('ceph -w') while my osd.58
>>>> goes down every 15 minute.
>>>>
>>>> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>> 148288/25473878 degraded (0.582%)
>>>> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>> 148288/25473878 degraded (0.582%)
>>>> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>> 148288/25473878 degraded (0.582%)
>>>> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>> 148288/25473878 degraded (0.582%)
>>>>
>>>> Here is a log from when I restarted osd.58 and through the next reboot
>>>> 15 minutes later: http://pastebin.com/rt64vx9M
>>>> Short, it just waits for 15 minutes not doing anything and then goes
>>>> down putting lots of lines like this in the log for that osd:
>>>>
>>>> 2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812/27234
>>>> >> 10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159
>>>> cs=1 l=0 c=0x35bcf1e0).fault with nothing to send, going to standby
>>>> 2014-09-14 20:02:08.519312 7fbd37b85700  0 -- 10.47.18.33:6812/27234
>>>> >> 10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370
>>>> cs=1 l=0 c=0x36cc4f20).fault with nothing to send, going to standby
>>>>
>>>> Then I have to restart it. And it repeats.
>>>>
>>>> What should/can I do? Take it out?
>>>>
>>>> I've got 4 servers with 24 disks each. Details about servers:
>>>> http://pastebin.com/XQeSh8gJ
>>>> Running dumpling - 0.67.10
>>>>
>>>> Cheers,
>>>> Christopher Thorjussen
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140917/614219a1/attachment.htm>