Re: Cascading Failure of OSDs

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 6 Mar 2015 16:51:31 -0800 (PST)

It looks like you may be able to work around the issue for the moment with

 ceph osd set nodeep-scrub

as it looks like it is scrub that is getting stuck?

sage

On Fri, 6 Mar 2015, Quentin Hartman wrote:

> Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
> active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
> an osd crash log (in github gist because it was too big for pastebin) -
> https://gist.github.com/qhartman/cb0e290df373d284cfb5
> 
> And now I've got four OSDs that are looping.....
> 
> On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
> <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
>       So I'm in the middle of trying to triage a problem with my ceph
>       cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
>       The cluster has been running happily for about a year. This last
>       weekend, something caused the box running the MDS to sieze hard,
>       and when we came in on monday, several OSDs were down or
>       unresponsive. I brought the MDS and the OSDs back on online, and
>       managed to get things running again with minimal data loss. Had
>       to mark a few objects as lost, but things were apparently
>       running fine at the end of the day on Monday.
> This afternoon, I noticed that one of the OSDs was apparently stuck in
> a crash/restart loop, and the cluster was unhappy. Performance was in
> the tank and "ceph status" is reporting all manner of problems, as one
> would expect if an OSD is misbehaving. I marked the offending OSD out,
> and the cluster started rebalancing as expected. However, I noticed a
> short while later, another OSD has started into a crash/restart loop.
> So, I repeat the process. And it happens again. At this point I
> notice, that there are actually two at a time which are in this state.
> 
> It's as if there's some toxic chunk of data that is getting passed
> around, and when it lands on an OSD it kills it. Contrary to that,
> however, I tried just stopping an OSD when it's in a bad state, and
> once the cluster starts to try rebalancing with that OSD down and not
> previously marked out, another OSD will start crash-looping.
> 
> I've investigated the disk of the first OSD I found with this problem,
> and it has no apparent corruption on the file system.
> 
> I'll follow up to this shortly with links to pastes of log snippets.
> Any input would be appreciated. This is turning into a real cascade
> failure, and I haven't any idea how to stop it.
> 
> QH
> 
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com