Difficulty with fixing an inconsistent PG/object

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Ceph community,


I hope you could help me with an issue we are experiencing on our backup cluster.

The Ceph version we are running here is 10.2.10 (Jewel), and we are using Filestore.
The PG is part of a replicated pool with size=2.


Getting the following error:
```

root@cephmon0:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 37.189 is active+clean+inconsistent, acting [144,170]
2 scrub errors
```

```
root@cephmon0:~# grep 37.189 /var/log/ceph/ceph.log
2022-06-29 11:11:27.782920 osd.144 10.129.160.22:6800/2810 7598 : cluster [INF] osd.144 pg 37.189 Deep scrub errors, upgrading scrub to deep-scrub
2022-06-29 11:11:27.884628 osd.144 10.129.160.22:6800/2810 7599 : cluster [INF] 37.189 deep-scrub starts
2022-06-29 11:13:07.124841 osd.144 10.129.160.22:6800/2810 7600 : cluster [ERR] 37.189 shard 144: soid 37:9193d307:::isqPpJMKYY4.000000000000001e:head data_digest 0x50007bd9 != data_digest 0x885fabcc from auth oi 37:9193d307:::isqPpJMKYY4.000000000000001e:head(7211'173457 osd.71.0:397191 dirty|data_digest|omap_digest s 4194304 uv 39699 dd 885fabcc od ffffffff alloc_hint [0 0])
2022-06-29 11:13:07.124849 osd.144 10.129.160.22:6800/2810 7601 : cluster [ERR] 37.189 shard 170: soid 37:9193d307:::isqPpJMKYY4.000000000000001e:head data_digest 0x50007bd9 != data_digest 0x885fabcc from auth oi 37:9193d307:::isqPpJMKYY4.000000000000001e:head(7211'173457 osd.71.0:397191 dirty|data_digest|omap_digest s 4194304 uv 39699 dd 885fabcc od ffffffff alloc_hint [0 0])
2022-06-29 11:13:07.124853 osd.144 10.129.160.22:6800/2810 7602 : cluster [ERR] 37.189 soid 37:9193d307:::isqPpJMKYY4.000000000000001e:head: failed to pick suitable auth object
2022-06-29 11:20:46.459906 osd.144 10.129.160.22:6800/2810 7603 : cluster [ERR] 37.189 deep-scrub 2 errors
```

The PG has already been transferred from 2 other OSDs. That is, the same error occurred when the PG was stored on two different OSDs. So it seems this is not a disk issue. There seems to be something wrong with the object "isqPpJMKYY4.000000000000001e".
However, when looking at the md5sum for the object. On both OSDs, this is the same.


```

root@ceph12:/var/lib/ceph/osd/ceph-144/current/37.189_head/DIR_9/DIR_8/DIR_9/DIR_C# ls -l isqPpJMKYY4.000000000000001e__head_E0CBC989__25

-rw-r--r-- 1 ceph ceph 4194304 Jun  3 09:56 isqPpJMKYY4.000000000000001e__head_E0CBC989__25

root@ceph12:/var/lib/ceph/osd/ceph-144/current/37.189_head/DIR_9/DIR_8/DIR_9/DIR_C# md5sum isqPpJMKYY4.000000000000001e__head_E0CBC989__25
96d702072cd441f2d0af60783e8db248  isqPpJMKYY4.000000000000001e__head_E0CBC989__25
```

```
root@ceph15:/var/lib/ceph/osd/ceph-170/current/37.189_head/DIR_9/DIR_8/DIR_9/DIR_C# ls -l isqPpJMKYY4.000000000000001e__head_E0CBC989__25
-rw-r--r-- 1 ceph ceph 4194304 Jun 23 16:41 isqPpJMKYY4.000000000000001e__head_E0CBC989__25

root@ceph15:/var/lib/ceph/osd/ceph-170/current/37.189_head/DIR_9/DIR_8/DIR_9/DIR_C# md5sum isqPpJMKYY4.000000000000001e__head_E0CBC989__25
96d702072cd441f2d0af60783e8db248  isqPpJMKYY4.000000000000001e__head_E0CBC989__25
```

```
root@cephmon0:~# rados list-inconsistent-obj 37.189 --format=json-pretty
{
    "epoch": 167653,
    "inconsistents": [
        {
            "object": {
                "name": "isqPpJMKYY4.000000000000001e",
                "nspace": "",
                "locator": "",
                "snap": "head",
                "version": 39699
            },
            "errors": [],
            "union_shard_errors": [
                "data_digest_mismatch_oi"
            ],
            "selected_object_info": "37:9193d307:::isqPpJMKYY4.000000000000001e:head(7211'173457 osd.71.0:397191 dirty|data_digest|omap_digest s 4194304 uv 39699 dd 885fabcc od ffffffff alloc_hint [0 0])",
            "shards": [
                {
                    "osd": 144,
                    "errors": [
                        "data_digest_mismatch_oi"
                    ],
                    "size": 4194304,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0x50007bd9"
                },
                {
                    "osd": 170,
                    "errors": [
                        "data_digest_mismatch_oi"
                    ],
                    "size": 4194304,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0x50007bd9"
                }
            ]
        }
    ]
}
```

I don't understand where there is a "data_digest_mismatch_oi" error. Since the checksums seem to match.

Does anyone have any idea on how to fix this?
Your input would be very much appreciated. Please let me know if you need additional info.

Thank you.

Best regards,
Lennart van Gijtenbeek

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux