Re: Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent

Frank Schilder <frans@xxxxxx> · Mon, 26 Jun 2023 20:35:07 +0000

Just in case, maybe this blog post contains some useful hints:
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

Its on a rather old ceph version, but the operations with objects might still be relevant. It requires that at least 1 OSD has a valid copy though.

You should try to find out which file/image this object belongs to from the user's perspective. If you have a backup/snapshot, you could mark the object as lost and restore a copy of the file/image from backup/snapshot. That's what others did in this situation.

You need to search this list for how to find that information. I believe there was something with ceph-encoder and low-level rados commands. Search for recovery_unfound and "und=found object", there should be many posts.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Monday, June 26, 2023 12:18 PM
To: Jorge JP; Stefan Kooman; ceph-users@xxxxxxx
Subject:  Re: Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent

Hi Jorge,

neither do I. You will need to wait for help on the list or try to figure something out with the docs.

Please be patient, a mark-unfound-lost is only needed if everything else has been tried and failed. Until then, clients that don't access the broken object should work fine.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Jorge JP <jorgejp@xxxxxxxxxx>
Sent: Monday, June 26, 2023 11:56 AM
To: Frank Schilder; Stefan Kooman; ceph-users@xxxxxxx
Subject: RE:  Re: Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent

Hello Frank,

Thank you. I ran the next command: ceph pg 32.15c list_unfound

I located the object but I don't know how solve this problem.

{
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "rbd_data.aedf52e8a44410.000000000000021f",
                "key": "",
                "snapid": -2,
                "hash": 358991196,
                "max": 0,
                "pool": 32,
                "namespace": ""
            },
            "need": "49128'125646582",
            "have": "0'0",
            "flags": "none",
            "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
            "locations": []
        }
    ],
    "more": false

Thank you.

________________________________
De: Frank Schilder <frans@xxxxxx>
Enviado: lunes, 26 de junio de 2023 11:43
Para: Jorge JP <jorgejp@xxxxxxxxxx>; Stefan Kooman <stefan@xxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Asunto: Re:  Re: Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent

I don't think pg repair will work. It looks like a 2(1) replicated pool where both OSDs seem to have accepted writes while the other was down and now the PG can't decide what is the true latest version.

Using size 2 min-size 1 comes with manual labor. As far as I can tell, you will need to figure out what files/objects are affected and either update the missing copy or delete the object manually.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Jorge JP <jorgejp@xxxxxxxxxx>
Sent: Monday, June 26, 2023 11:34 AM
To: Stefan Kooman; ceph-users@xxxxxxx
Subject:  Re: Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent

Hello Stefan,

I run this command yesterday but the status not changed. Other pgs with status "inconsistent" was repaired after a day, but in this case, not works.

instructing pg 32.15c on osd.49 to repair

Normally, the pg will changed to repair but not.

________________________________
De: Stefan Kooman <stefan@xxxxxx>
Enviado: lunes, 26 de junio de 2023 11:27
Para: Jorge JP <jorgejp@xxxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Asunto: Re:  Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent

On 6/26/23 08:38, Jorge JP wrote:
> Hello,
>
> After deep-scrub my cluster shown this error:
>
> HEALTH_ERR 1/38578006 objects unfound (0.000%); 1 scrub errors; Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent; Degraded data redundancy: 2/77158878 objects degraded (0.000%), 1 pg degraded
> [WRN] OBJECT_UNFOUND: 1/38578006 objects unfound (0.000%)
>      pg 32.15c has 1 unfound objects
> [ERR] OSD_SCRUB_ERRORS: 1 scrub errors
> [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 1 pg inconsistent
>      pg 32.15c is active+recovery_unfound+degraded+inconsistent, acting [49,47], 1 unfound
> [WRN] PG_DEGRADED: Degraded data redundancy: 2/77158878 objects degraded (0.000%), 1 pg degraded
>      pg 32.15c is active+recovery_unfound+degraded+inconsistent, acting [49,47], 1 unfound
>
>
> I searching in internet how it solves, but I'm confusing..
>
> Anyone can help me?

Does "ceph pg repair 32.15c" work for you?

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx