Since I have been in ceph-land today, it reminded me that I needed to close the loop on this. I was finally able to isolate this problem down to a faulty NIC on the ceph cluster network. It "worked", but it was accumulating a huge number of Rx errors. My best guess is some receive buffer cache failed? Anyway, having a NIC go weird like that is totally consistent with all the weird problems I was seeing, the corrupted PGs, and the inability for the cluster to settle down.
As a result we've added NIC error rates to our monitoring suite on the cluster so we'll hopefully see this coming if it ever happens again.
QH
On Sat, Mar 7, 2015 at 11:36 AM, Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
So I'm not sure what has changed, but in the last 30 minutes the errors which were all over the place, have finally settled down to this: http://pastebin.com/VuCKwLDpThe only thing I can think of is that I also net the noscrub flag in addition to the nodeep-scrub when I first got here, and that finally "took". Anyway, they've been stable there for some time now, and I've been able to get a couple VMs to come up and behave reasonably well. At this point I'm prepared to wipe the entire cluster and start over if I have to to get it truly consistent again, since my efforts to zap pg 3.75b haven't borne fruit. However, if anyone has a less nuclear option they'd like to suggest, I'm all ears.I've tried to export/re-import the pg and do a force_create. The import failed, and the force_create just reverted back to being incomplete after "creating" for a few minutes.QHOn Sat, Mar 7, 2015 at 9:29 AM, Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:Now that I have a better understanding of what's happening, I threw together a little one-liner to create a report of the errors that the OSDs are seeing. Lots of missing / corrupted pg shards: https://gist.github.com/qhartman/174cc567525060cb462eI've experimented with exporting / importing the broken pgs with ceph_objectstore_tool, and while they seem to export correctly, the tool crashes when trying to import:root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import --data-path /var/lib/ceph/osd/ceph-19/ --journal-path /var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.exportImporting pgid 3.75bWrite 2672075b/rbd_data.2bce2ae8944a.0000000000001509/head//3Write 3473075b/rbd_data.1d6172ae8944a.000000000001636a/head//3Write f2e4075b/rbd_data.c816f2ae8944a.0000000000000208/head//3Write f215075b/rbd_data.c4a892ae8944a.0000000000000b6b/head//3Write c086075b/rbd_data.42a742ae8944a.00000000000002fb/head//3Write 6f9d075b/rbd_data.1d6172ae8944a.0000000000005ac3/head//3Write dd9f075b/rbd_data.1d6172ae8944a.000000000001127d/head//3Write f9f075b/rbd_data.c4a892ae8944a.000000000000f056/head//3Write 4d71175b/rbd_data.c4a892ae8944a.0000000000009e51/head//3Write bcc3175b/rbd_data.2bce2ae8944a.000000000000133f/head//3Write 1356175b/rbd_data.3f862ae8944a.00000000000005d6/head//3Write d327175b/rbd_data.1d6172ae8944a.000000000001af85/head//3Write 7388175b/rbd_data.2bce2ae8944a.0000000000001353/head//3Write 8cda175b/rbd_data.c4a892ae8944a.000000000000b585/head//3Write 6b3c175b/rbd_data.c4a892ae8944a.0000000000018e91/head//3Write d37f175b/rbd_data.1d6172ae8944a.0000000000003a90/head//3Write 4590275b/rbd_data.2bce2ae8944a.0000000000001f67/head//3Write fe51275b/rbd_data.c4a892ae8944a.000000000000e917/head//3Write 3402275b/rbd_data.3f5c2ae8944a.0000000000001252/6//3osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const hobject_t&, const std::set<snapid_t>&, MapCacher::Transaction<std::basic_string<char>, ceph::buffer::list>*)' thread 7fba67ff3900 time 2015-03-07 16:21:57.921820osd/SnapMapper.cc: 228: FAILED assert(r == -2)ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb94fbb]2: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const&, MapCacher::Transaction<std::string, ceph::buffer::list>*)+0x63e) [0x7b719e]3: (get_attrs(ObjectStore*, coll_t, ghobject_t, ObjectStore::Transaction*, ceph::buffer::list&, OSDriver&, SnapMapper&)+0x67c) [0x661a1c]4: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5) [0x661f85]5: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]6: (main()+0x2208) [0x63f178]7: (__libc_start_main()+0xf5) [0x7fba627b2ec5]8: ceph_objectstore_tool() [0x659577]NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.terminate called after throwing an instance of 'ceph::FailedAssertion'*** Caught signal (Aborted) **in thread 7fba67ff3900ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)1: ceph_objectstore_tool() [0xab1cea]2: (()+0x10340) [0x7fba66a95340]3: (gsignal()+0x39) [0x7fba627c7cc9]4: (abort()+0x148) [0x7fba627cb0d8]5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fba630d26b5]6: (()+0x5e836) [0x7fba630d0836]7: (()+0x5e863) [0x7fba630d0863]8: (()+0x5eaa2) [0x7fba630d0aa2]9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xb951a8]10: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const&, MapCacher::Transaction<std::string, ceph::buffer::list>*)+0x63e) [0x7b719e]11: (get_attrs(ObjectStore*, coll_t, ghobject_t, ObjectStore::Transaction*, ceph::buffer::list&, OSDriver&, SnapMapper&)+0x67c) [0x661a1c]12: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5) [0x661f85]13: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]14: (main()+0x2208) [0x63f178]15: (__libc_start_main()+0xf5) [0x7fba627b2ec5]16: ceph_objectstore_tool() [0x659577]Aborted (core dumped)Which I suppose is expected if it's importing from bad pg data. At this point I'm really most interested in what I can do to get this cluster consistent as quickly as possible so I can start coping with the data loss in the VMs and start restoring from backups where needed. Any guidance in that direction would be appreciated. Something along the lines of "give up on that busted pg" is what I'm thinking of, but I haven't noticed anything that seems to approximate that yet.ThanksQHOn Fri, Mar 6, 2015 at 8:47 PM, Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:Here's more information I have been able to glean:pg 3.5d3 is stuck inactive for 917.471444, current state incomplete, last acting [24]pg 3.690 is stuck inactive for 11991.281739, current state incomplete, last acting [24]pg 4.ca is stuck inactive for 15905.499058, current state incomplete, last acting [24]pg 3.5d3 is stuck unclean for 917.471550, current state incomplete, last acting [24]pg 3.690 is stuck unclean for 11991.281843, current state incomplete, last acting [24]pg 4.ca is stuck unclean for 15905.499162, current state incomplete, last acting [24]pg 3.19c is incomplete, acting [24] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 4.ca is incomplete, acting [24] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 5.7a is incomplete, acting [24] (reducing pool backups min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 5.6b is incomplete, acting [24] (reducing pool backups min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.6bf is incomplete, acting [24] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.690 is incomplete, acting [24] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.5d3 is incomplete, acting [24] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')However, that list of incomplete pgs keeps changing each time I run "ceph health detail | grep incomplete". For example, here is the output regenerated moments after I created the above:HEALTH_ERR 34 pgs incomplete; 2 pgs inconsistent; 37 pgs peering; 470 pgs stale; 13 pgs stuck inactive; 13 pgs stuck unclean; 4 scrub errors; 1/24 in osds are down; noout,nodeep-scrub flag(s) setpg 3.da is stuck inactive for 7977.699449, current state incomplete, last acting [19]pg 3.1a4 is stuck inactive for 6364.787502, current state incomplete, last acting [14]pg 4.c4 is stuck inactive for 8759.642771, current state incomplete, last acting [14]pg 3.4fa is stuck inactive for 8173.078486, current state incomplete, last acting [14]pg 3.372 is stuck inactive for 6706.018758, current state incomplete, last acting [14]pg 3.4ca is stuck inactive for 7121.446109, current state incomplete, last acting [14]pg 0.6 is stuck inactive for 8759.591368, current state incomplete, last acting [14]pg 3.343 is stuck inactive for 7996.560271, current state incomplete, last acting [14]pg 3.453 is stuck inactive for 6420.686656, current state incomplete, last acting [14]pg 3.4c1 is stuck inactive for 7049.443221, current state incomplete, last acting [14]pg 3.80 is stuck inactive for 7587.105164, current state incomplete, last acting [14]pg 3.4a7 is stuck inactive for 5506.691333, current state incomplete, last acting [14]pg 3.5ce is stuck inactive for 7153.943506, current state incomplete, last acting [14]pg 3.da is stuck unclean for 11816.026865, current state incomplete, last acting [19]pg 3.1a4 is stuck unclean for 8759.633093, current state incomplete, last acting [14]pg 3.4fa is stuck unclean for 8759.658848, current state incomplete, last acting [14]pg 4.c4 is stuck unclean for 8759.642866, current state incomplete, last acting [14]pg 3.372 is stuck unclean for 8759.662338, current state incomplete, last acting [14]pg 3.4ca is stuck unclean for 8759.603350, current state incomplete, last acting [14]pg 0.6 is stuck unclean for 8759.591459, current state incomplete, last acting [14]pg 3.343 is stuck unclean for 8759.645236, current state incomplete, last acting [14]pg 3.453 is stuck unclean for 8759.643875, current state incomplete, last acting [14]pg 3.4c1 is stuck unclean for 8759.606092, current state incomplete, last acting [14]pg 3.80 is stuck unclean for 8759.644522, current state incomplete, last acting [14]pg 3.4a7 is stuck unclean for 12723.462164, current state incomplete, last acting [14]pg 3.5ce is stuck unclean for 10024.882545, current state incomplete, last acting [14]pg 3.1a4 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 4.1a1 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 4.138 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.da is incomplete, acting [19] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 4.c4 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.80 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 4.70 is incomplete, acting [19] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 4.76 is incomplete, acting [19] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 4.57 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.4c is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 5.18 is incomplete, acting [19] (reducing pool backups min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 4.13 is incomplete, acting [14] (reducing pool images min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 0.6 is incomplete, acting [14] (reducing pool data min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.7dc is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.6b4 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.692 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.5fc is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.5ce is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.4fa is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.4ca is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.4c1 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.4a7 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.460 is incomplete, acting [19] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.453 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.394 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.372 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.343 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.337 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.321 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.2c0 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.27c is incomplete, acting [19] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.27e is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.244 is incomplete, acting [14] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')pg 3.207 is incomplete, acting [19] (reducing pool volumes min_size from 2 may help; search ceph.com/docs for 'incomplete')Why would this keep changing? It seems like it would have to be because of the OSDs running through their crash loops, only accurately reporting from time to time, making it difficult to get an accurate view of the extent of the damage.On Fri, Mar 6, 2015 at 8:30 PM, Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:Thanks for the response. Is this the post you are referring to? http://ceph.com/community/incomplete-pgs-oh-my/For what it's worth, this cluster was running happily for the better part of a year until the event from this weekend that I described in my first post, so I doubt it's configuration issue. I suppose it could be some edge-casey thing, that only came up just now, but that seems unlikely. Our usage of this cluster has been much heavier in the past than it has been recently.And yes, I have what looks to be about 8 pg shards on several OSDs that seem to be in this state, but it's hard to say for sure. It seems like each time I look at this more problems are popping up.On Fri, Mar 6, 2015 at 8:19 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:This might be related to the backtrace assert, but that's the problem
you need to focus on. In particular, both of these errors are caused
by the scrub code, which Sage suggested temporarily disabling — if
you're still getting these messages, you clearly haven't done so
successfully.
That said, it looks like the problem is that the object and/or object
info specified here are just totally busted. You probably want to
figure out what happened there since these errors are normally a
misconfiguration somewhere (e.g., setting nobarrier on fs mount and
then losing power). I'm not sure if there's a good way to repair the
object, but if you can lose the data I'd grab the ceph-objectstore
tool and remove the object from each OSD holding it that way. (There's
a walkthrough of using it for a similar situation in a recent Ceph
blog post.)
On Fri, Mar 6, 2015 at 7:14 PM, Quentin Hartman
<qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
> Alright, tried a few suggestions for repairing this state, but I don't seem
> to have any PG replicas that have good copies of the missing / zero length
> shards. What do I do now? telling the pg's to repair doesn't seem to help
> anything? I can deal with data loss if I can figure out which images might
> be damaged, I just need to get the cluster consistent enough that the things
> which aren't damaged can be usable.
>
> Also, I'm seeing these similar, but not quite identical, error messages as
> well. I assume they are referring to the same root problem:
>
> -1> 2015-03-07 03:12:49.217295 7fc8ab343700 0 log [ERR] : 3.69d shard 22:
> soid dd85669d/rbd_data.3f7a2ae8944a.00000000000019a5/7//3 size 0 != known
> size 4194304
Mmm, unfortunately that's a different object than the one referenced
in the earlier crash. Maybe it's repairable, or it might be the same
issue — looks like maybe you've got some widespread data loss.
-Greg
> _______________________________________________
>
>
>
> On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman
> <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
>>
>> Finally found an error that seems to provide some direction:
>>
>> -1> 2015-03-07 02:52:19.378808 7f175b1cf700 0 log [ERR] : scrub 3.18e
>> e08a418e/rbd_data.3f7a2ae8944a.00000000000016c8/7//3 on disk size (0) does
>> not match object info size (4120576) ajusted for ondisk to (4120576)
>>
>> I'm diving into google now and hoping for something useful. If anyone has
>> a suggestion, I'm all ears!
>>
>> QH
>>
>> On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman
>> <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> Thanks for the suggestion, but that doesn't seem to have made a
>>> difference.
>>>
>>> I've shut the entire cluster down and brought it back up, and my config
>>> management system seems to have upgraded ceph to 0.80.8 during the reboot.
>>> Everything seems to have come back up, but I am still seeing the crash
>>> loops, so that seems to indicate that this is definitely something
>>> persistent, probably tied to the OSD data, rather than some weird transient
>>> state.
>>>
>>>
>>> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>>>
>>>> It looks like you may be able to work around the issue for the moment
>>>> with
>>>>
>>>> ceph osd set nodeep-scrub
>>>>
>>>> as it looks like it is scrub that is getting stuck?
>>>>
>>>> sage
>>>>
>>>>
>>>> On Fri, 6 Mar 2015, Quentin Hartman wrote:
>>>>
>>>> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
>>>> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
>>>> > an osd crash log (in github gist because it was too big for pastebin)
>>>> > -
>>>> > https://gist.github.com/qhartman/cb0e290df373d284cfb5
>>>> >
>>>> > And now I've got four OSDs that are looping.....
>>>> >
>>>> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
>>>> > <qhartman@xxxxxxxxxxxxxxxxxxx> wrote:
>>>> > So I'm in the middle of trying to triage a problem with my ceph
>>>> > cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
>>>> > The cluster has been running happily for about a year. This last
>>>> > weekend, something caused the box running the MDS to sieze hard,
>>>> > and when we came in on monday, several OSDs were down or
>>>> > unresponsive. I brought the MDS and the OSDs back on online, and
>>>> > managed to get things running again with minimal data loss. Had
>>>> > to mark a few objects as lost, but things were apparently
>>>> > running fine at the end of the day on Monday.
>>>> > This afternoon, I noticed that one of the OSDs was apparently stuck in
>>>> > a crash/restart loop, and the cluster was unhappy. Performance was in
>>>> > the tank and "ceph status" is reporting all manner of problems, as one
>>>> > would expect if an OSD is misbehaving. I marked the offending OSD out,
>>>> > and the cluster started rebalancing as expected. However, I noticed a
>>>> > short while later, another OSD has started into a crash/restart loop.
>>>> > So, I repeat the process. And it happens again. At this point I
>>>> > notice, that there are actually two at a time which are in this state.
>>>> >
>>>> > It's as if there's some toxic chunk of data that is getting passed
>>>> > around, and when it lands on an OSD it kills it. Contrary to that,
>>>> > however, I tried just stopping an OSD when it's in a bad state, and
>>>> > once the cluster starts to try rebalancing with that OSD down and not
>>>> > previously marked out, another OSD will start crash-looping.
>>>> >
>>>> > I've investigated the disk of the first OSD I found with this problem,
>>>> > and it has no apparent corruption on the file system.
>>>> >
>>>> > I'll follow up to this shortly with links to pastes of log snippets.
>>>> > Any input would be appreciated. This is turning into a real cascade
>>>> > failure, and I haven't any idea how to stop it.
>>>> >
>>>> > QH
>>>> >
>>>> >
>>>> >
>>>> >
>>>
>>>
>>
>
>
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com