Re: Recommended procedure in case of OSD_SCRUB_ERRORS / PG_DAMAGED

Eugen Block <eblock@xxxxxx> · Wed, 19 Oct 2022 11:50:33 +0000

Hi,

you don't need to stop the OSDs, just query the inconsistent object,  
here's a recent example (form an older cluster though):

---snip---
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent

admin:~ # ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 7.17a is active+clean+inconsistent, acting [15,2,58,33,28,69]

admin:~ # rados -p cephfs_data list-inconsistent-obj 7.17a | jq
[...]
      "shards": [
        {
          "osd": 2,
          "primary": false,
          "errors": [],
          "size": 2780496,
          "omap_digest": "0xffffffff",
          "data_digest": "0x11e1764c"
        },
        {
          "osd": 15,
          "primary": true,
          "errors": [],
          "size": 2780496,
          "omap_digest": "0xffffffff",
          "data_digest": "0x11e1764c"
        },
        {
          "osd": 28,
          "primary": false,
          "errors": [],
          "size": 2780496,
          "omap_digest": "0xffffffff",
          "data_digest": "0x11e1764c"
        },
        {
          "osd": 33,
          "primary": false,
          "errors": [
            "read_error"
          ],
          "size": 2780496
        },
        {
          "osd": 58,
          "primary": false,
          "errors": [],
          "size": 2780496,
          "omap_digest": "0xffffffff",
          "data_digest": "0x11e1764c"
        },
        {
          "osd": 69,
          "primary": false,
          "errors": [],
          "size": 2780496,
          "omap_digest": "0xffffffff",
          "data_digest": "0x11e1764c"
---snip---

Five of the six omap_digest and data_digest values were identical, so  
it was safe to run 'ceph pg repair 7.17a'.

Regards,
Eugen

Zitat von E Taka <0etaka0@xxxxxxxxx>:

(17.2.4, 3 replicated, Container install)

Hello,

since many of the information found in the WWW or books is outdated, I want
to ask which procedure is recommended to repair damaged PG with status
active+clean+inconsistent for Ceph Quincy.

IMHO, the best process for a pool with 3 replicas it would be to check if
two of the replicas are identical and replace the third different one.

If I understand it correctly, the ceph-objectstore-tool could be used for
this approach, but unfortunately it is difficult even to start, especially
in a Docker environment. (OSD have to marked as "down", the Ubuntu package
ceph-osd, where ceph-objectstore-tool is included, starts server processes
which confuse the dockerized environment).

Is “ceph pg repair” safe to use, and is there a risk to enable
osd_scrub_auto_repair and osd_repair_during_recovery?

Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx