PG_DAMAGED Possible data damage: 1 pg inconsistent

Yoann Moulin <yoann.moulin@xxxxxxx> · Wed, 21 Feb 2018 09:40:27 +0100

Hello,

I migrated my cluster from jewel to luminous 3 weeks ago (using ceph-ansible playbook), a few days after, ceph status told me "PG_DAMAGED
Possible data damage: 1 pg inconsistent", I tried to repair the PG without success, I tried to stop the OSD, flush the journal and restart the
OSDs but the OSD refuse to start due to a bad journal. I decided to destroy the OSD and recreated it from scratch. After that, everything seemed
to be all right, but, I just saw now I have exactly the same error again on the same PG on the same OSD (78).

> $ ceph health detail
> HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 3 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
>     pg 11.5f is active+clean+inconsistent, acting [78,154,170]

> $ ceph -s
>   cluster:
>     id:     f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
>     health: HEALTH_ERR
>             3 scrub errors
>             Possible data damage: 1 pg inconsistent
>  
>   services:
>     mon: 3 daemons, quorum iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
>     mgr: iccluster001(active), standbys: iccluster009, iccluster017
>     mds: cephfs-3/3/3 up  {0=iccluster022.iccluster.epfl.ch=up:active,1=iccluster006.iccluster.epfl.ch=up:active,2=iccluster014.iccluster.epfl.ch=up:active}
>     osd: 180 osds: 180 up, 180 in
>     rgw: 6 daemons active
>  
>   data:
>     pools:   29 pools, 10432 pgs
>     objects: 82862k objects, 171 TB
>     usage:   515 TB used, 465 TB / 980 TB avail
>     pgs:     10425 active+clean
>              6     active+clean+scrubbing+deep
>              1     active+clean+inconsistent
>  
>   io:
>     client:   21538 B/s wr, 0 op/s rd, 33 op/s wr

> ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)

Short log :

> 2018-02-21 09:08:33.408396 7fb7b8222700  0 log_channel(cluster) log [DBG] : 11.5f repair starts
> 2018-02-21 09:08:33.727277 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 11.5f shard 78: soid 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9- b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd ffffffff od d46bb5a1 alloc_hint [0 0 0])
> 2018-02-21 09:08:33.727290 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 11.5f shard 154: soid 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd ffffffff od d46bb5a1 alloc_hint [0 0 0])
> 2018-02-21 09:08:33.727293 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 11.5f shard 170: soid 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd ffffffff od d46bb5a1 alloc_hint [0 0 0])
> 2018-02-21 09:08:33.727295 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 11.5f soid 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head: failed to pick suitable auth object
> 2018-02-21 09:08:33.727333 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 11.5f repair 3 errors, 0 fixed

I set "debug_osd 20/20" on osd.78 and start the repair again, the log file is here :

ceph-post-file: 1ccac8ea-0947-4fe4-90b1-32d1048548f1

What can I do in that situation ?

Thanks for your help.

-- 
Yoann Moulin
EPFL IC-IT
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com