Re: Troubleshooting Incomplete PGs

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Wed, 22 Oct 2014 17:22:59 -0700

Shot in the dark: try manually deep-scrubbing the PG.  You could also try marking various osd's OUT, in an attempt to get the acting set to include osd.25 again, then do the deep-scrub again.  That probably won't help though, because the pg query says it probed osd.25 already... actually , it doesn't.  osd.25 is in "probing_osds" not "probed_osds".  The deep-scrub might move things along.

Re-reading your original post, if you marked the slow osds OUT, but left them running, you should not have lost data.

If the scrubs don't help, it's probably time to hop on IRC.

On Wed, Oct 22, 2014 at 5:08 PM, Chris Kitzmiller <ckitzmiller@xxxxxxxxxxxxx> wrote:
On Oct 22, 2014, at 7:51 PM, Craig Lewis wrote:

> On Wed, Oct 22, 2014 at 3:09 PM, Chris Kitzmiller <ckitzmiller@xxxxxxxxxxxxx> wrote:

>> On Oct 22, 2014, at 1:50 PM, Craig Lewis wrote:

>>> Incomplete means "Ceph detects that a placement group is missing a necessary period of history from its log. If you see this state, report a bug, and try to start any failed OSDs that may contain the needed information".

>>>

>>> In the PG query, it lists some OSDs that it's trying to probe:

>>>           "probing_osds": [

>>>                 "10",

>>>                 "13",

>>>                 "15",

>>>                 "25"],

>>>           "down_osds_we_would_probe": [],

>>>

>>> Is one of those the OSD you replaced?  If so, you might try ceph pg {pg-id} mark_unfound_lost revert|delete.  That command will lose data; it tells Ceph to give up looking for data that it can't find, so you might want to wait a bit.

>>

>> Yes. osd.10 was the OSD I replaced. :( I suspect that I didn't actually have any writes during this time and that a revert might leave me in an OK place.

>>

>> Looking at the query more closely I see that all of the peers (except osd.10) have the same value for last_update/last_complete/last_scrub/last_deep_scrub except that the peer entry on osd.10 has 0 values for everything. It's as if all my OSDs are believing in the ghost of this PG on osd.10. I'd like to revert I just want to make sure that I'm going to revert to the sane value and not the 0 value.

>

> I've never (successfully) used mark_unfound_lost, so I can't say exactly what'll happen.  revert should be what you need, but I don't know if it's going to revert to the point in time before whatever hole in the history happened, or if it will just give up on the portions of history that it doesn't have.

Huh. So I tried `ceph pg 3.222 mark_unfound_lost revert` and it told me "pg has no unfound objects" and indeed: "num_objects_unfound": 0,

On one of the peers, osd.25 (which isn't in the acting set now and was up+in the whole time) it reports:

        "stat_sum": { "num_bytes": 7080120320,

                "num_objects": 1697,

                "num_object_clones": 0,

                "num_object_copies": 3394,

                "num_objects_missing_on_primary": 0,

                "num_objects_degraded": 0,

                "num_objects_unfound": 0,

                "num_objects_dirty": 1697,

                "num_whiteouts": 0,

                "num_read": 72828,

                "num_read_kb": 8794722,

                "num_write": 32405,

                "num_write_kb": 11424120,

                "num_scrub_errors": 0,

                "num_shallow_scrub_errors": 0,

                "num_deep_scrub_errors": 0,

                "num_objects_recovered": 1687,

                "num_bytes_recovered": 7038177280,

                "num_keys_recovered": 0,

                "num_objects_omap": 0,

                "num_objects_hit_set_archive": 0},

So, is it the 10 objects which are dirty but not recovered which are giving me trouble? What can be done to correct these PGs?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com