Re: recovery_unfound

Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> · Tue, 4 Feb 2020 11:26:10 +0000

Hi Paul,

Many thanks for your helpful suggestions.

Yes, we have 13 pgs with "might_have_unfound" entries.

(also 1 pgs without "might_have_unfound" stuck in
active+recovery_unfound+degraded+repair state)

Taking one pg with unfound objects:

[root@ceph1 ~]# ceph health detail | grep  5.5c9
    pg 5.5c9 has 2 unfound objects
    pg 5.5c9 is active+recovery_unfound+degraded, acting
[347,442,381,215,91,260,31,94,178,302], 2 unfound
    pg 5.5c9 is active+recovery_unfound+degraded, acting
[347,442,381,215,91,260,31,94,178,302], 2 unfound
    pg 5.5c9 not deep-scrubbed since 2020-01-16 08:05:43.119336
    pg 5.5c9 not scrubbed since 2020-01-16 08:05:43.119336

Checking the state:

[root@ceph1 ~]# ceph pg 5.5c9 query | jq .recovery_state
[
  {
    "name": "Started/Primary/Active",
    "enter_time": "2020-02-03 09:57:30.982038",
    "might_have_unfound": [
      {
        "osd": "31(6)",
        "status": "already probed"
      },
      {
        "osd": "91(4)",
        "status": "already probed"
      },
      {
        "osd": "94(7)",
        "status": "already probed"
      },
      {
        "osd": "178(8)",
        "status": "already probed"
      },
      {
        "osd": "215(3)",
        "status": "already probed"
      },
      {
        "osd": "260(5)",
        "status": "already probed"
      },
      {
        "osd": "302(9)",
        "status": "already probed"
      },
      {
        "osd": "381(2)",
        "status": "already probed"
      },
      {
        "osd": "442(1)",
        "status": "already probed"
      }
    ],
    "recovery_progress": {
      "backfill_targets": [],
      "waiting_on_backfill": [],
      "last_backfill_started": "MIN",
      "backfill_info": {
        "begin": "MIN",
        "end": "MIN",
        "objects": []
      },
      "peer_backfill_info": [],
      "backfills_in_flight": [],
      "recovering": [],
      "pg_backend": {
        "recovery_ops": [],
        "read_ops": []
      }
    },
    "scrub": {
      "scrubber.epoch_start": "0",
      "scrubber.active": false,
      "scrubber.state": "INACTIVE",
      "scrubber.start": "MIN",
      "scrubber.end": "MIN",
      "scrubber.max_end": "MIN",
      "scrubber.subset_last_update": "0'0",
      "scrubber.deep": false,
      "scrubber.waiting_on_whom": []
    }
  },
  {
    "name": "Started",
    "enter_time": "2020-02-03 09:57:29.788310"
  }
]

-----------------------------------------------------

Taking your advice, I restart the primary osd for this pg:

[root@ceph1 ~]# ceph osd down 347

This doesn't change the output of "ceph pg 5.5c9 query", apart from
updating the Started time, and ceph health still shows unfound objects.

To fix this, do we need to issue a scrub (or deep scrub) so that the
objects can be found?

Just in case, I've issued a manual scrub:

[root@ceph1 ~]# ceph pg scrub 5.5c9
instructing pg 5.5c9s0 on osd.347 to scrub

The cluster is currently busy deleting snapshots, so it may take a while
before the scrub starts.

best regards,

Jake

On 2/3/20 6:31 PM, Paul Emmerich wrote:
> This might be related to recent problems with OSDs not being queried
> for unfound objects properly in some cases (which I think was fixed in
> master?)
> 
> Anyways: run ceph pg <pg> query on the affected PGs, check for "might
> have unfound" and try restarting the OSDs mentioned there. Probably
> also sufficient to just run "ceph osd down" on the primaries on the
> affected PGs to get them to re-check.
> 
> 
> Paul
> 

-- 
Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx