PG_DAMAGED: Possible data damage: 4 pgs recovery_unfound

Eric Dold <dold.eric@xxxxxxxxx> · Wed, 17 Aug 2022 23:53:02 +0200

Hi everyone,

It seems like I hit Bug #44286 Cache tiering shows unfound objects after
OSD reboots <https://tracker.ceph.com/issues/44286>.

I did stop some OSD's to compact the RocksDB on them. Noout was set during
this time.
Soon after that i got:

[ERR] PG_DAMAGED: Possible data damage: 4 pgs recovery_unfound
    pg 8.8 is active+recovery_unfound+degraded, acting [42,43,39], 1 unfound
    pg 8.14 is active+recovery_unfound+degraded, acting [43,40,42], 1
unfound
    pg 8.3b is active+recovery_unfound+degraded, acting [36,40,43], 1
unfound
    pg 8.50 is active+recovery_unfound+degraded, acting [39,38,36], 1
unfound

ceph pg 8.8 list_unfound
{
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "hit_set_8.8_archive_2022-08-12
12:12:06.515941Z_2022-08-12 12:18:16.186156Z",
                "key": "",
                "snapid": -2,
                "hash": 8,
                "max": 0,
                "pool": 8,
                "namespace": ".ceph-internal"
            },
            "need": "118438'7610615",
            "have": "0'0",
            "flags": "none",
            "clean_regions": "clean_offsets: [], clean_omap: 0, new_object:
1",
            "locations": []
        }
    ],
    "state": "NotRecovering",
    "available_might_have_unfound": true,
    "might_have_unfound": [],
    "more": false
}

The other missing objects look the same. The oid is hit_set_*
So i guess no data is affected. The question is how to get rid of the error.

This is a cache pool with replica x3 in front of a cephfs with ec 6+2.
Affected are the hit set objects from the cache pool.
Everything seems to work so far. The cluster is in "HEALTH_ERR: Possible
data damage: 4 pgs recovery_unfound" though.

I could not get the PG's do deep scrub to find the missing objects. It also
did not work when i disabled scrubbing on all OSD's except the affected
ones.
Repairing the PG does also not start since it's a scrub operation as well.
They are just queued for deep scrub but nothing is happening.

I did try
ceph pg deep-scrub 8.8
ceph pg repair 8.8
I also tried to set one of the primary OSD out, but the affected PG did
stay on that OSD.

What's the best course of action to get the cluster back to a healthy state?

Should i make

ceph pg 8.8 mark_unfound_lost revert
or
ceph pg 8.8 mark_unfound_lost delete

Or is there another way?
Would the cache pool still work after that?

Thanks,
Eric
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx