Re: OSD commits suicide

Erik Logtenberg <erik@xxxxxxxxxxxxx> · Sat, 15 Nov 2014 21:38:15 +0100

Hi,

Thanks for the tip, I applied these configuration settings and it does
lower the load during rebuilding a bit. Are there settings like these
that also tune Ceph down a bit during regular operations? The slow
requests, timeouts and OSD suicides are killing me.

If I allow the cluster to regain consciousness and stay idle a bit, it
all seems to settle down nicely, but as soon as I apply some load it
immediately starts to overstress and complain like crazy.

I'm also seeing this behaviour: http://tracker.ceph.com/issues/9844
This was reported by Dmitry Smirnov 26 days ago, but the report has no
response yet. Any ideas?

In my experience, OSD's are quite unstable in Giant and very easily
stressed, causing chain effects, further worsening the issues. It would
be nice to know if this is also noticed by other users?

Thanks,

Erik.

On 11/10/2014 08:40 PM, Craig Lewis wrote:
> Have you tuned any of the recovery or backfill parameters?  My ceph.conf
> has:
> [osd]
>   osd max backfills = 1
>   osd recovery max active = 1
>   osd recovery op priority = 1
> 
> Still, if it's running for a few hours, then failing, it sounds like
> there might be something else at play.  OSDs use a lot of RAM during
> recovery.  How much RAM and how many OSDs do you have in these nodes? 
> What does memory usage look like after a fresh restart, and what does it
> look like when the problems start?  Even better if you know what it
> looks like 5 minutes before the problems start.
> 
> Is there anything interesting in the kernel logs?  OOM killers, or
> memory deadlocks?
> 
> 
> 
> On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg <erik@xxxxxxxxxxxxx
> <mailto:erik@xxxxxxxxxxxxx>> wrote:
> 
>     Hi,
> 
>     I have some OSD's that keep committing suicide. My cluster has ~1.3M
>     misplaced objects, and it can't really recover, because OSD's keep
>     failing before recovering finishes. The load on the hosts is quite high,
>     but the cluster currently has no other tasks than just the
>     backfilling/recovering.
> 
>     I attached the logfile from a failed OSD. It shows the suicide, the
>     recent events and also me starting the OSD again after some time.
> 
>     It'll keep running for a couple of hours and then fail again, for the
>     same reason.
> 
>     I noticed a lot of timeouts. Apparently ceph stresses the hosts to the
>     limit with the recovery tasks, so much that they timeout and can't
>     finish that task. I don't understand why. Can I somehow throttle ceph a
>     bit so that it doesn't keep overrunning itself? I kinda feel like it
>     should chill out a bit and simply recover one step at a time instead of
>     full force and then fail.
> 
>     Thanks,
> 
>     Erik.
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com