On Fri, Sep 19, 2014 at 3:07 AM, Craig Lewis <clewis at centraldesktop.com> wrote: > The magic in Sage's steps was really setting noup. That gives the OSD > time to apply the osdmap changes, without starting the timeout. Set noup, > nodown, noout, restart the OSD, and wait until the CPU usage goes to zero. > Some of mine took 5 minutes. Once it's done, unset noup, and restart > again. The OSD should join the cluster, and not spin the CPU forever. > Repeat for every OSD. > > > The XFS params caused my OSDs to crash often enough to cause the big > osdmap backlog. I was seeing "XFS: possible memory allocation deadlock in > kmem_alloc" in dmesg. ceph.conf had > [osd] > "osd mkfs options xfs": "-l size=1024m -n size=64k -i size=2048 -s > size=4096" > > I fixed the problem by changing the config to > [osd] > "osd mkfs options xfs": "-s size=4096" > > Then reformated every OSD in my cluster (one at a time). The -n size=64k > was the problem. It looks like the 3.14 kernels have a fix: > http://tracker.ceph.com/issues/6301. Upgrading the kernel might be less > painful that reformatting everything. > > > Thanks. It might also be that we hit a new bug: https://github.com/ceph/ceph/commit/23876d73e30521ad4f1230e9533295660bc47f2d Thanks to some consulting hours from Florian we've dug deeper and this is where we are at the moment; I'm waiting for the patch to verify..? ?Also, We tossed 1000 seconds (!) snap trim sl?eep at the OSD during the last night with a few restarts of OSDs that were spinning and for the first time in many days our system reports back all pgs as active+clean. We are still not there yet, OSDs are spinning and noout/noup/nodown/noscrub/nodeep-scrub is still in place. I'm waiting for the patch and some feedback from Florian before we will take it out of all "nosettings".. ?/Christopher? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140919/b6f25c53/attachment.htm>