Hi, Thanks for the tip, I applied these configuration settings and it does lower the load during rebuilding a bit. Are there settings like these that also tune Ceph down a bit during regular operations? The slow requests, timeouts and OSD suicides are killing me. If I allow the cluster to regain consciousness and stay idle a bit, it all seems to settle down nicely, but as soon as I apply some load it immediately starts to overstress and complain like crazy. I'm also seeing this behaviour: http://tracker.ceph.com/issues/9844 This was reported by Dmitry Smirnov 26 days ago, but the report has no response yet. Any ideas? In my experience, OSD's are quite unstable in Giant and very easily stressed, causing chain effects, further worsening the issues. It would be nice to know if this is also noticed by other users? Thanks, Erik. On 11/10/2014 08:40 PM, Craig Lewis wrote: > Have you tuned any of the recovery or backfill parameters? My ceph.conf > has: > [osd] > osd max backfills = 1 > osd recovery max active = 1 > osd recovery op priority = 1 > > Still, if it's running for a few hours, then failing, it sounds like > there might be something else at play. OSDs use a lot of RAM during > recovery. How much RAM and how many OSDs do you have in these nodes? > What does memory usage look like after a fresh restart, and what does it > look like when the problems start? Even better if you know what it > looks like 5 minutes before the problems start. > > Is there anything interesting in the kernel logs? OOM killers, or > memory deadlocks? > > > > On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg <erik@xxxxxxxxxxxxx > <mailto:erik@xxxxxxxxxxxxx>> wrote: > > Hi, > > I have some OSD's that keep committing suicide. My cluster has ~1.3M > misplaced objects, and it can't really recover, because OSD's keep > failing before recovering finishes. The load on the hosts is quite high, > but the cluster currently has no other tasks than just the > backfilling/recovering. > > I attached the logfile from a failed OSD. It shows the suicide, the > recent events and also me starting the OSD again after some time. > > It'll keep running for a couple of hours and then fail again, for the > same reason. > > I noticed a lot of timeouts. Apparently ceph stresses the hosts to the > limit with the recovery tasks, so much that they timeout and can't > finish that task. I don't understand why. Can I somehow throttle ceph a > bit so that it doesn't keep overrunning itself? I kinda feel like it > should chill out a bit and simply recover one step at a time instead of > full force and then fail. > > Thanks, > > Erik. > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com