Re: Cascading Failure of OSDs

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Fri, 6 Mar 2015 19:54:20 -0700

Finally found an error that seems to provide some direction:
-1> 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e e08a418e/rbd_data.3f7a2ae8944a.00000000000016c8/7//3 on disk size (0) does not match object info size (4120576) ajusted for ondisk to (4120576)

I'm diving into google now and hoping for something useful. If anyone has a suggestion, I'm all ears!

QH

On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
Thanks for the suggestion, but that doesn't seem to have made a difference.
I've shut the entire cluster down and brought it back up, and my config management system seems to have upgraded ceph to 0.80.8 during the reboot. Everything seems to have come back up, but I am still seeing the crash loops, so that seems to indicate that this is definitely something persistent, probably tied to the OSD data, rather than some weird transient state.

On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
It looks like you may be able to work around the issue for the moment with

 ceph osd set nodeep-scrub

as it looks like it is scrub that is getting stuck?

sage

On Fri, 6 Mar 2015, Quentin Hartman wrote:

> Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with

> active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ

> an osd crash log (in github gist because it was too big for pastebin) -

> https://gist.github.com/qhartman/cb0e290df373d284cfb5

>

> And now I've got four OSDs that are looping.....

>

> On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman

> <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:

>       So I'm in the middle of trying to triage a problem with my ceph

>       cluster running 0.80.5. I have 24 OSDs spread across 8 machines.

>       The cluster has been running happily for about a year. This last

>       weekend, something caused the box running the MDS to sieze hard,

>       and when we came in on monday, several OSDs were down or

>       unresponsive. I brought the MDS and the OSDs back on online, and

>       managed to get things running again with minimal data loss. Had

>       to mark a few objects as lost, but things were apparently

>       running fine at the end of the day on Monday.

> This afternoon, I noticed that one of the OSDs was apparently stuck in

> a crash/restart loop, and the cluster was unhappy. Performance was in

> the tank and "ceph status" is reporting all manner of problems, as one

> would expect if an OSD is misbehaving. I marked the offending OSD out,

> and the cluster started rebalancing as expected. However, I noticed a

> short while later, another OSD has started into a crash/restart loop.

> So, I repeat the process. And it happens again. At this point I

> notice, that there are actually two at a time which are in this state.

>

> It's as if there's some toxic chunk of data that is getting passed

> around, and when it lands on an OSD it kills it. Contrary to that,

> however, I tried just stopping an OSD when it's in a bad state, and

> once the cluster starts to try rebalancing with that OSD down and not

> previously marked out, another OSD will start crash-looping.

>

> I've investigated the disk of the first OSD I found with this problem,

> and it has no apparent corruption on the file system.

>

> I'll follow up to this shortly with links to pastes of log snippets.

> Any input would be appreciated. This is turning into a real cascade

> failure, and I haven't any idea how to stop it.

>

> QH

>

>

>

> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com