osd going down every 15m blocking recovery from degraded state

christopher.thorjussen@xxxxxxxxxxxxxxxxxxxxxxx (Christopher Thorjussen) · Fri, 19 Sep 2014 09:28:46 +0200

On Fri, Sep 19, 2014 at 3:07 AM, Craig Lewis <clewis at centraldesktop.com>
wrote:

> The magic in Sage's steps was really setting noup.  That gives the OSD
> time to apply the osdmap changes, without starting the timeout.  Set noup,
> nodown, noout, restart the OSD, and wait until the CPU usage goes to zero.
>  Some of mine took 5 minutes.  Once it's done, unset noup, and restart
> again.  The OSD should join the cluster, and not spin the CPU forever.
>  Repeat for every OSD.
>
>
> The XFS params caused my OSDs to crash often enough to cause the big
> osdmap backlog.  I was seeing "XFS: possible memory allocation deadlock in
> kmem_alloc" in dmesg.  ceph.conf had
> [osd]
>    "osd mkfs options xfs": "-l size=1024m -n size=64k -i size=2048 -s
> size=4096"
>
> I fixed the problem by changing the config to
> [osd]
>    "osd mkfs options xfs": "-s size=4096"
>
> Then reformated every OSD in my cluster (one at a time).  The -n size=64k
> was the problem.  It looks like the 3.14 kernels have a fix:
> http://tracker.ceph.com/issues/6301.  Upgrading the kernel might be less
> painful that reformatting everything.
>
>
> Thanks.

It might also be that we hit a new bug:
https://github.com/ceph/ceph/commit/23876d73e30521ad4f1230e9533295660bc47f2d
Thanks to some consulting hours from Florian we've dug deeper and this is
where we are at the moment;
I'm waiting for the patch to verify..?

?Also, We tossed 1000 seconds (!)  snap trim sl?eep at the OSD during the
last night with a few restarts of OSDs that were spinning and for the first
time in many days our system reports back all pgs as active+clean. We are
still not there yet, OSDs are spinning and
noout/noup/nodown/noscrub/nodeep-scrub is still in place. I'm waiting for
the patch and some feedback from Florian before we will take it out of all
"nosettings"..

?/Christopher?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140919/b6f25c53/attachment.htm>