Re: recovery_unfound

Paul Emmerich <paul.emmerich@xxxxxxxx> · Mon, 3 Feb 2020 19:31:01 +0100

This might be related to recent problems with OSDs not being queried
for unfound objects properly in some cases (which I think was fixed in
master?)

Anyways: run ceph pg <pg> query on the affected PGs, check for "might
have unfound" and try restarting the OSDs mentioned there. Probably
also sufficient to just run "ceph osd down" on the primaries on the
affected PGs to get them to re-check.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Feb 3, 2020 at 4:27 PM Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
>
> Dear All,
>
> Due to a mistake in my "rolling restart" script, one of our ceph
> clusters now has a number of unfound objects:
>
> There is an 8+2 erasure encoded data pool, 3x replicated metadata pool,
> all data is stored as cephfs.
>
> root@ceph7 ceph-archive]# ceph health
> HEALTH_ERR 24/420880027 objects unfound (0.000%); Possible data damage:
> 14 pgs recovery_unfound; Degraded data redundancy: 64/4204261148 objects
> degraded (0.000%), 14 pgs degraded
>
> "ceph health detail" gives me a handle on which pgs are affected.
> e.g:
>     pg 5.f2f has 2 unfound objects
>     pg 5.5c9 has 2 unfound objects
>     pg 5.4c1 has 1 unfound objects
> and so on...
>
> plus more entries of this type:
>   pg 5.6d is active+recovery_unfound+degraded, acting
> [295,104,57,442,240,338,219,33,150,382], 1 unfound
>     pg 5.3fa is active+recovery_unfound+degraded, acting
> [343,147,21,131,315,63,214,365,264,437], 2 unfound
>     pg 5.41d is active+recovery_unfound+degraded, acting
> [20,104,190,377,52,141,418,358,240,289], 1 unfound
>
> Digging deeper into one of the bad pg, we see the oid for the two
> unfound objects:
>
> root@ceph7 ceph-archive]# ceph pg 5.f2f list_unfound
> {
>     "num_missing": 4,
>     "num_unfound": 2,
>     "objects": [
>         {
>             "oid": {
>                 "oid": "1000ba25e49.00000207",
>                 "key": "",
>                 "snapid": -2,
>                 "hash": 854007599,
>                 "max": 0,
>                 "pool": 5,
>                 "namespace": ""
>             },
>             "need": "22541'3088478",
>             "have": "0'0",
>             "flags": "none",
>             "locations": [
>                 "189(8)",
>                 "263(9)"
>             ]
>         },
>         {
>             "oid": {
>                 "oid": "1000bb25a5b.00000091",
>                 "key": "",
>                 "snapid": -2,
>                 "hash": 3637976879,
>                 "max": 0,
>                 "pool": 5,
>                 "namespace": ""
>             },
>             "need": "22541'3088476",
>             "have": "0'0",
>             "flags": "none",
>             "locations": [
>                 "189(8)",
>                 "263(9)"
>             ]
>         }
>     ],
>     "more": false
> }
>
>
> While it would be nice to recover the data, this cluster is only used
> for storing backups.
>
> As all OSD are up and running, presumably the data blocks are
> permanently lost?
>
> If it's hard / impossible to recover the data, presumably we should now
> consider using "ceph pg 5.f2f  mark_unfound_lost delete" on each
> affected pg?
>
> Finally, can we use the oid to identify the affected files?
>
> best regards,
>
> Jake
>
> --
> Jake Grimmett
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx