1 pg inconsistent and does not recover

Niklas Hambüchen <mail@xxxxxx> · Wed, 28 Jun 2023 00:47:07 +0200

Hi,

I have a 3x-replicated pool with Ceph 12.2.7.

One HDD broke, its OSD "2" was automatically marked as "out", the disk was physically replaced by a new one, and that added back in.

Now `ceph health detail` continues to permanently show:

    [ERR] OSD_SCRUB_ERRORS: 1 scrub errors
    [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
        pg 2.87 is active+clean+inconsistent, acting [33,2,20]

What exactly is wrong here?

Why can Ceph not fix the issue?
With BlueStore I have checksums, on two unbroken disks, so what remaining inconsistency can there be?

The suggested command in https://docs.ceph.com/en/pacific/rados/operations/pg-repair/#commands-for-diagnosing-pg-problems does not work:

    # rados list-inconsistent-obj 2.87
    No scrub information available for pg 2.87
    error 2: (2) No such file or directory

Further, I find the documentation in https://docs.ceph.com/en/pacific/rados/operations/pg-repair/#more-information-on-pg-repair extremely unclear.
It says

In the case of replicated pools, recovery is beyond the scope of pg repair.

while many people on the Internet suggest that `ceph pg repair` might fix the issue.
Yet again others claim that Ceph will fix the issue itself.
I am hesitant to run "ceph pg repair" without understanding what the problem is and what exactly this will do.

I have already reported the "error 2" and the documentation in issue https://tracker.ceph.com/issues/61739 but not received a reply yet, and my cluster stays "inconsistent".

How can this be fixed?

I would appreciate any help!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx