Re: Cascading Failure of OSDs

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Fri, 6 Mar 2015 20:47:45 -0700

Here's more information I have been able to glean:
pg 3.5d3 is stuck inactive for 917.471444, current state incomplete, last acting [24]
pg 3.690 is stuck inactive for 11991.281739, current state incomplete, last acting [24]
pg 4.ca is stuck inactive for 15905.499058, current state incomplete, last acting [24]
pg 3.5d3 is stuck unclean for 917.471550, current state incomplete, last acting [24]
pg 3.690 is stuck unclean for 11991.281843, current state incomplete, last acting [24]
pg 4.ca is stuck unclean for 15905.499162, current state incomplete, last acting [24]
pg 3.19c is incomplete, acting [24] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 4.ca is incomplete, acting [24] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 5.7a is incomplete, acting [24] (reducing pool backups min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 5.6b is incomplete, acting [24] (reducing pool backups min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.6bf is incomplete, acting [24] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.690 is incomplete, acting [24] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.5d3 is incomplete, acting [24] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')

However, that list of incomplete pgs keeps changing each time I run "ceph health detail | grep incomplete". For example, here is the output regenerated moments after I created the above:

HEALTH_ERR 34 pgs incomplete; 2 pgs inconsistent; 37 pgs peering; 470 pgs stale; 13 pgs stuck inactive; 13 pgs stuck unclean; 4 scrub errors; 1/24 in osds are down; noout,nodeep-scrub flag(s) set
pg 3.da is stuck inactive for 7977.699449, current state incomplete, last acting [19]
pg 3.1a4 is stuck inactive for 6364.787502, current state incomplete, last acting [14]
pg 4.c4 is stuck inactive for 8759.642771, current state incomplete, last acting [14]
pg 3.4fa is stuck inactive for 8173.078486, current state incomplete, last acting [14]
pg 3.372 is stuck inactive for 6706.018758, current state incomplete, last acting [14]
pg 3.4ca is stuck inactive for 7121.446109, current state incomplete, last acting [14]
pg 0.6 is stuck inactive for 8759.591368, current state incomplete, last acting [14]
pg 3.343 is stuck inactive for 7996.560271, current state incomplete, last acting [14]
pg 3.453 is stuck inactive for 6420.686656, current state incomplete, last acting [14]
pg 3.4c1 is stuck inactive for 7049.443221, current state incomplete, last acting [14]
pg 3.80 is stuck inactive for 7587.105164, current state incomplete, last acting [14]
pg 3.4a7 is stuck inactive for 5506.691333, current state incomplete, last acting [14]
pg 3.5ce is stuck inactive for 7153.943506, current state incomplete, last acting [14]
pg 3.da is stuck unclean for 11816.026865, current state incomplete, last acting [19]
pg 3.1a4 is stuck unclean for 8759.633093, current state incomplete, last acting [14]
pg 3.4fa is stuck unclean for 8759.658848, current state incomplete, last acting [14]
pg 4.c4 is stuck unclean for 8759.642866, current state incomplete, last acting [14]
pg 3.372 is stuck unclean for 8759.662338, current state incomplete, last acting [14]
pg 3.4ca is stuck unclean for 8759.603350, current state incomplete, last acting [14]
pg 0.6 is stuck unclean for 8759.591459, current state incomplete, last acting [14]
pg 3.343 is stuck unclean for 8759.645236, current state incomplete, last acting [14]
pg 3.453 is stuck unclean for 8759.643875, current state incomplete, last acting [14]
pg 3.4c1 is stuck unclean for 8759.606092, current state incomplete, last acting [14]
pg 3.80 is stuck unclean for 8759.644522, current state incomplete, last acting [14]
pg 3.4a7 is stuck unclean for 12723.462164, current state incomplete, last acting [14]
pg 3.5ce is stuck unclean for 10024.882545, current state incomplete, last acting [14]
pg 3.1a4 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 4.1a1 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 4.138 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.da is incomplete, acting [19] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 4.c4 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.80 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 4.70 is incomplete, acting [19] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 4.76 is incomplete, acting [19] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 4.57 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.4c is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 5.18 is incomplete, acting [19] (reducing pool backups min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 4.13 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 0.6 is incomplete, acting [14] (reducing pool data min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.7dc is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.6b4 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.692 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.5fc is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.5ce is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.4fa is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.4ca is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.4c1 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.4a7 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.460 is incomplete, acting [19] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.453 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.394 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.372 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.343 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.337 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.321 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.2c0 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.27c is incomplete, acting [19] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.27e is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.244 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 3.207 is incomplete, acting [19] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')

Why would this keep changing? It seems like it would have to be because of the OSDs running through their crash loops, only accurately reporting from time to time, making it difficult to get an accurate view of the extent of the damage.

On Fri, Mar 6, 2015 at 8:30 PM, Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
Thanks for the response. Is this the post you are referring to? http://ceph.com/community/incomplete-pgs-oh-my/
For what it's worth, this cluster was running happily for the better part of a year until the event from this weekend that I described in my first post, so I doubt it's configuration issue. I suppose it could be some edge-casey thing, that only came up just now, but that seems unlikely. Our usage of this cluster has been much heavier in the past than it has been recently.

And yes, I have what looks to be about 8 pg shards on several OSDs that seem to be in this state, but it's hard to say for sure. It seems like each time I look at this more problems are popping up.

On Fri, Mar 6, 2015 at 8:19 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
This might be related to the backtrace assert, but that's the problem

you need to focus on. In particular, both of these errors are caused

by the scrub code, which Sage suggested temporarily disabling — if

you're still getting these messages, you clearly haven't done so

successfully.

That said, it looks like the problem is that the object and/or object

info specified here are just totally busted. You probably want to

figure out what happened there since these errors are normally a

misconfiguration somewhere (e.g., setting nobarrier on fs mount and

then losing power). I'm not sure if there's a good way to repair the

object, but if you can lose the data I'd grab the ceph-objectstore

tool and remove the object from each OSD holding it that way. (There's

a walkthrough of using it for a similar situation in a recent Ceph

blog post.)

On Fri, Mar 6, 2015 at 7:14 PM, Quentin Hartman

<qhartman@xxxxxxxxxxxxxxxxxxx> wrote:

> Alright, tried a few suggestions for repairing this state, but I don't seem

> to have any PG replicas that have good copies of the missing / zero length

> shards. What do I do now? telling the pg's to repair doesn't seem to help

> anything? I can deal with data loss if I can figure out which images might

> be damaged, I just need to get the cluster consistent enough that the things

> which aren't damaged can be usable.

>

> Also, I'm seeing these similar, but not quite identical, error messages as

> well. I assume they are referring to the same root problem:

>

> -1> 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d shard 22:

> soid dd85669d/rbd_data.3f7a2ae8944a.00000000000019a5/7//3 size 0 != known

> size 4194304

Mmm, unfortunately that's a different object than the one referenced

in the earlier crash. Maybe it's repairable, or it might be the same

issue — looks like maybe you've got some widespread data loss.

-Greg

>

>

>

> On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman

> <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:

>>

>> Finally found an error that seems to provide some direction:

>>

>> -1> 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e

>> e08a418e/rbd_data.3f7a2ae8944a.00000000000016c8/7//3 on disk size (0) does

>> not match object info size (4120576) ajusted for ondisk to (4120576)

>>

>> I'm diving into google now and hoping for something useful. If anyone has

>> a suggestion, I'm all ears!

>>

>> QH

>>

>> On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman

>> <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:

>>>

>>> Thanks for the suggestion, but that doesn't seem to have made a

>>> difference.

>>>

>>> I've shut the entire cluster down and brought it back up, and my config

>>> management system seems to have upgraded ceph to 0.80.8 during the reboot.

>>> Everything seems to have come back up, but I am still seeing the crash

>>> loops, so that seems to indicate that this is definitely something

>>> persistent, probably tied to the OSD data, rather than some weird transient

>>> state.

>>>

>>>

>>> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:

>>>>

>>>> It looks like you may be able to work around the issue for the moment

>>>> with

>>>>

>>>>  ceph osd set nodeep-scrub

>>>>

>>>> as it looks like it is scrub that is getting stuck?

>>>>

>>>> sage

>>>>

>>>>

>>>> On Fri, 6 Mar 2015, Quentin Hartman wrote:

>>>>

>>>> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with

>>>> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ

>>>> > an osd crash log (in github gist because it was too big for pastebin)

>>>> > -

>>>> > https://gist.github.com/qhartman/cb0e290df373d284cfb5

>>>> >

>>>> > And now I've got four OSDs that are looping.....

>>>> >

>>>> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman

>>>> > <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:

>>>> >       So I'm in the middle of trying to triage a problem with my ceph

>>>> >       cluster running 0.80.5. I have 24 OSDs spread across 8 machines.

>>>> >       The cluster has been running happily for about a year. This last

>>>> >       weekend, something caused the box running the MDS to sieze hard,

>>>> >       and when we came in on monday, several OSDs were down or

>>>> >       unresponsive. I brought the MDS and the OSDs back on online, and

>>>> >       managed to get things running again with minimal data loss. Had

>>>> >       to mark a few objects as lost, but things were apparently

>>>> >       running fine at the end of the day on Monday.

>>>> > This afternoon, I noticed that one of the OSDs was apparently stuck in

>>>> > a crash/restart loop, and the cluster was unhappy. Performance was in

>>>> > the tank and "ceph status" is reporting all manner of problems, as one

>>>> > would expect if an OSD is misbehaving. I marked the offending OSD out,

>>>> > and the cluster started rebalancing as expected. However, I noticed a

>>>> > short while later, another OSD has started into a crash/restart loop.

>>>> > So, I repeat the process. And it happens again. At this point I

>>>> > notice, that there are actually two at a time which are in this state.

>>>> >

>>>> > It's as if there's some toxic chunk of data that is getting passed

>>>> > around, and when it lands on an OSD it kills it. Contrary to that,

>>>> > however, I tried just stopping an OSD when it's in a bad state, and

>>>> > once the cluster starts to try rebalancing with that OSD down and not

>>>> > previously marked out, another OSD will start crash-looping.

>>>> >

>>>> > I've investigated the disk of the first OSD I found with this problem,

>>>> > and it has no apparent corruption on the file system.

>>>> >

>>>> > I'll follow up to this shortly with links to pastes of log snippets.

>>>> > Any input would be appreciated. This is turning into a real cascade

>>>> > failure, and I haven't any idea how to stop it.

>>>> >

>>>> > QH

>>>> >

>>>> >

>>>> >

>>>> >

>>>

>>>

>>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com