'incomplete' PGs: what does it mean?

john@xxxxxxxxxxx (John Morris) · Fri, 29 Aug 2014 02:45:04 -0400 (EDT)

Greg, thanks for the tips in both this and the BTRFS_IOC_SNAP_CREATE
thread.  They were enough to get PGs 'incomplete' due to 'not enough
OSDs hosting' resolved by rolling back to a btrfs snapshot.  I promise
to write a full post-mortem (embarrassing as it will be) after the
cluster is fully healthy.

As is my kind of luck, the cluster *also* suffers from eight of the
*other* 'missing log' sort of 'incomplete' PGs:

2014-08-28 23:42:03.350612 7f1cc9d82700 20 osd.1 pg_epoch: 13085
  pg[2.5c( v 10143'300715 (9173'297714,10143'300715] local-les=11890
  n=3404 ec=1 les/c 11890/10167 12809/12809/12809) [7,3,0,4]
  r=-1 lpr=12809 pi=8453-12808/138 lcod 0'0
  inactive NOTIFY]
  handle_activate_map: Not dirtying info:
  last_persisted is 13010 while current is 13085

The data clearly exists in the renegade OSD's data directory.  There
are no reported 'unfound' objects, so 'mark_unfound_lost revert' doesn't
apply.  No apparently useful data from 'ceph pg ... query', but an
example is pasted below.

Since the beginning of the cluster rebuild, all ceph clients have been
turned off, so I believe there's no fear of lost data by reverting to
these PGs, and besides there're always backup tapes.

How can Ceph be told to accept the versions it sees on osd.1 as the
most current version, and forget the missing log history?

       John

$ ceph pg 2.5c query
{ "state": "incomplete",
  "epoch": 13241,
  "up": [
        7,
        1,
        0,
        4],
  "acting": [
        7,
        1,
        0,
        4],
  "info": { "pgid": "2.5c",
      "last_update": "10143'300715",
      "last_complete": "10143'300715",
      "log_tail": "9173'297715",
      "last_backfill": "0\/\/0\/\/-1",
      "purged_snaps": "[]",
      "history": { "epoch_created": 1,
          "last_epoch_started": 11890,
          "last_epoch_clean": 10167,
          "last_epoch_split": 0,
          "same_up_since": 13229,
          "same_interval_since": 13229,
          "same_primary_since": 13118,
          "last_scrub": "10029'298459",
          "last_scrub_stamp": "2014-08-18 17:36:01.079649",
          "last_deep_scrub": "8323'284793",
          "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106",
          "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649"},
      "stats": { "version": "10143'300715",
          "reported_seq": "1764",
          "reported_epoch": "13241",
          "state": "incomplete",
          "last_fresh": "2014-08-29 01:35:44.196909",
          "last_change": "2014-08-29 01:22:50.298880",
          "last_active": "0.000000",
          "last_clean": "0.000000",
          "last_became_active": "0.000000",
          "last_unstale": "2014-08-29 01:35:44.196909",
          "mapping_epoch": 13223,
          "log_start": "9173'297715",
          "ondisk_log_start": "9173'297715",
          "created": 1,
          "last_epoch_clean": 10167,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "10029'298459",
          "last_scrub_stamp": "2014-08-18 17:36:01.079649",
          "last_deep_scrub": "8323'284793",
          "last_deep_scrub_stamp": "2014-08-15 17:38:06.229106",
          "last_clean_scrub_stamp": "2014-08-18 17:36:01.079649",
          "log_size": 3000,
          "ondisk_log_size": 3000,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                7,
                1,
                0,
                4],
          "acting": [
                7,
                1,
                0,
                4]},
      "empty": 0,
      "dne": 0,
      "incomplete": 1,
      "last_epoch_started": 10278},
  "recovery_state": [
        { "name": "Started\/Primary\/Peering",
          "enter_time": "2014-08-29 01:22:50.132826",
          "past_intervals": [
[...]
                { "first": 12809,
                  "last": 13101,
                  "maybe_went_rw": 1,
                  "up": [
                        7,
                        3,
                        0,
                        4],
                  "acting": [
                        7,
                        3,
                        0,
                        4]},
[...]
],
          "probing_osds": [
                0,
                1,
                2,
                3,
                4,
                5,
                7],
          "down_osds_we_would_probe": [],
          "peering_blocked_by": []},
        { "name": "Started",
          "enter_time": "2014-08-29 01:22:50.132784"}]}

On Wed, Aug 27, 2014 at 12:40 PM, Gregory Farnum <greg at inktank.com>
wrote:

> On Tue, Aug 26, 2014 at 10:46 PM, John Morris <john at zultron.com>
> wrote:
> > In the docs [1], 'incomplete' is defined thusly:
> >
> >   Ceph detects that a placement group is missing a necessary period
> >   of
> >   history from its log. If you see this state, report a bug, and try
> >   to start any failed OSDs that may contain the needed information.
> >
> > However, during an extensive review of list postings related to
> > incomplete PGs, an alternate and oft-repeated definition is
> > something
> > like 'the number of existing replicas is less than the min_size of
> > the
> > pool'.  In no list posting was there any acknowledgement of the
> > definition from the docs.
> >
> > While trying to understand what 'incomplete' PGs are, I simply set
> > min_size = 1 on this cluster with incomplete PGs, and they continue
> > to
> > be 'incomplete'.  Does this mean that definition #2 is incorrect?
> >
> > In case #1 is correct, how can the cluster be told to forget the
> > lapse
> > in history?  In our case, there was nothing writing to the cluster
> > during the OSD reorganization that could have caused this lapse.
> 
> Yeah, these two meanings can (unfortunately) both lead to the
> INCOMPLETE state being reported. I think that's going to be fixed in
> our next major release (so that INCOMPLETE means "not enough OSDs
> hosting" and "missing log" will translate into something else), but
> for now the "not enough OSDs" is by far the more common. In your case
> you probably are missing history, but you don't want to recover from
> it using any of the cluster tools because they're likely to lose more
> data than necessary. (Hopefully, you can just roll back to a slightly
> older btrfs snapshot, but we'll see).
> -Greg