osd going down every 15m blocking recovery from degraded state

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Thu, 18 Sep 2014 18:07:36 -0700

The magic in Sage's steps was really setting noup.  That gives the OSD time
to apply the osdmap changes, without starting the timeout.  Set noup,
nodown, noout, restart the OSD, and wait until the CPU usage goes to zero.
 Some of mine took 5 minutes.  Once it's done, unset noup, and restart
again.  The OSD should join the cluster, and not spin the CPU forever.
 Repeat for every OSD.

The XFS params caused my OSDs to crash often enough to cause the big osdmap
backlog.  I was seeing "XFS: possible memory allocation deadlock in
kmem_alloc" in dmesg.  ceph.conf had
[osd]
   "osd mkfs options xfs": "-l size=1024m -n size=64k -i size=2048 -s
size=4096"

I fixed the problem by changing the config to
[osd]
   "osd mkfs options xfs": "-s size=4096"

Then reformated every OSD in my cluster (one at a time).  The -n size=64k
was the problem.  It looks like the 3.14 kernels have a fix:
http://tracker.ceph.com/issues/6301.  Upgrading the kernel might be less
painful that reformatting everything.

On Tue, Sep 16, 2014 at 3:19 PM, Christopher Thorjussen <
christopher.thorjussen at onlinebackupcompany.com> wrote:

> I've been throught your post many times (google likes it ;)
> I've been trying all the noout/nodown/noup.
> But I will look into the XFS issue you are talking about. And read all of
> the post one more time..
>
> /C
>
>
> On Wed, Sep 17, 2014 at 12:01 AM, Craig Lewis <clewis at centraldesktop.com>
> wrote:
>
>> I ran into a similar issue before.  I was having a lot of OSD crashes
>> caused by XFS memory allocation deadlocks.  My OSDs crashed so many times
>> that they couldn't replay the OSD Map before they would be marked
>> unresponsive.
>>
>> See if this sounds familiar:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040002.html
>>
>> If so, Sage's procedure to apply the osdmaps fixed my cluster:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040176.html
>>
>>
>>
>>
>>
>> On Tue, Sep 16, 2014 at 2:51 PM, Christopher Thorjussen <
>> christopher.thorjussen at onlinebackupcompany.com> wrote:
>>
>>> I've got several osds that are spinning at 100%.
>>>
>>> I've retained some professional services to have a look. Its out of my
>>> newbie reach..
>>>
>>> /Christopher
>>>
>>> On Tue, Sep 16, 2014 at 11:23 PM, Craig Lewis <clewis at centraldesktop.com
>>> > wrote:
>>>
>>>> Is it using any CPU or Disk I/O during the 15 minutes?
>>>>
>>>> On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen <
>>>> christopher.thorjussen at onlinebackupcompany.com> wrote:
>>>>
>>>>> I'm waiting for my cluster to recover from a crashed disk and a second
>>>>> osd that has been taken out (crushmap, rm, stopped).
>>>>>
>>>>> Now I'm stuck at looking at this output ('ceph -w') while my osd.58
>>>>> goes down every 15 minute.
>>>>>
>>>>> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1
>>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>>> 148288/25473878 degraded (0.582%)
>>>>> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1
>>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>>> 148288/25473878 degraded (0.582%)
>>>>> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1
>>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>>> 148288/25473878 degraded (0.582%)
>>>>> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1
>>>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301
>>>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail;
>>>>> 148288/25473878 degraded (0.582%)
>>>>>
>>>>> Here is a log from when I restarted osd.58 and through the next reboot
>>>>> 15 minutes later: http://pastebin.com/rt64vx9M
>>>>> Short, it just waits for 15 minutes not doing anything and then goes
>>>>> down putting lots of lines like this in the log for that osd:
>>>>>
>>>>> 2014-09-14 20:02:08.517727 7fbd3909a700  0 -- 10.47.18.33:6812/27234
>>>>> >> 10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159
>>>>> cs=1 l=0 c=0x35bcf1e0).fault with nothing to send, going to standby
>>>>> 2014-09-14 20:02:08.519312 7fbd37b85700  0 -- 10.47.18.33:6812/27234
>>>>> >> 10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370
>>>>> cs=1 l=0 c=0x36cc4f20).fault with nothing to send, going to standby
>>>>>
>>>>> Then I have to restart it. And it repeats.
>>>>>
>>>>> What should/can I do? Take it out?
>>>>>
>>>>> I've got 4 servers with 24 disks each. Details about servers:
>>>>> http://pastebin.com/XQeSh8gJ
>>>>> Running dumpling - 0.67.10
>>>>>
>>>>> Cheers,
>>>>> Christopher Thorjussen
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140918/b276cb5f/attachment.htm>