Hi Pascal, We just had a similar situation on our RBD and had found some bad data in RADOS here is How we did it: for i in $(rados list-inconsistent-pg $POOL | jq -er .[]); do rados list-inconsistent-obj $i | jq -er .inconsistents[].object.name| awk -F'.' '{print $2}'; done we than found inconsistent snaps on the Object: rados list-inconsistent-snapset $PG --format=json-pretty | jq .inconsistents[].name List the data on the OSD's (ceph pg map $PG) ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${OSD}/ --op list ${OBJ} --pgid ${PG} and finally remove the object, like: ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-459/ --op list rbd_data.762a94d768c04d.000000000036b7ac --pgid 2.704ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-459/ '["2.704",{"oid":"rbd_data.801e1d1d9c719d.0000000000044943","key":"","snapid":125458,"hash":4136961796,"max":0,"pool":2,"namespace":"","max":0}]' remove we had to do it for all OSD one after the other after this a 'pg repair' worked i hope it will help Ansgar Am Do., 23. Juni 2022 um 15:02 Uhr schrieb Dan van der Ster <dvanders@xxxxxxxxx>: > > Hi Pascal, > > It's not clear to me how the upgrade procedure you described would > lead to inconsistent PGs. > > Even if you didn't record every step, do you have the ceph.log, the > mds logs, perhaps some osd logs from this time? > And which versions did you upgrade from / to ? > > Cheers, Dan > > On Wed, Jun 22, 2022 at 7:41 PM Pascal Ehlert <pascal@xxxxxxxxxxxx> wrote: > > > > Hi all, > > > > I am currently battling inconsistent PGs after a far-reaching mistake > > during the upgrade from Octopus to Pacific. > > While otherwise following the guide, I restarted the Ceph MDS daemons > > (and this started the Pacific daemons) without previously reducing the > > ranks to 1 (from 2). > > > > This resulted in daemons not coming up and reporting inconsistencies. > > After later reducing the ranks and bringing the MDS back up (I did not > > record every step as this was an emergency situation), we started seeing > > health errors on every scrub. > > > > Now after three weeks, while our CephFS is still working fine and we > > haven't noticed any data damage, we realized that every single PG of the > > cephfs metadata pool is affected. > > Below you can find some information on the actual status and a detailed > > inspection of one of the affected pgs. I am happy to provide any other > > information that could be useful of course. > > > > A repair of the affected PGs does not resolve the issue. > > Does anyone else here have an idea what we could try apart from copying > > all the data to a new CephFS pool? > > > > > > > > Thank you! > > > > Pascal > > > > > > > > > > root@srv02:~# ceph status > > cluster: > > id: f0d6d4d0-8c17-471a-9f95-ebc80f1fee78 > > health: HEALTH_ERR > > insufficient standby MDS daemons available > > 69262 scrub errors > > Too many repaired reads on 2 OSDs > > Possible data damage: 64 pgs inconsistent > > > > services: > > mon: 3 daemons, quorum srv02,srv03,srv01 (age 3w) > > mgr: srv03(active, since 3w), standbys: srv01, srv02 > > mds: 2/2 daemons up, 1 hot standby > > osd: 44 osds: 44 up (since 3w), 44 in (since 10M) > > > > data: > > volumes: 2/2 healthy > > pools: 13 pools, 1217 pgs > > objects: 75.72M objects, 26 TiB > > usage: 80 TiB used, 42 TiB / 122 TiB avail > > pgs: 1153 active+clean > > 55 active+clean+inconsistent > > 9 active+clean+inconsistent+failed_repair > > > > io: > > client: 2.0 MiB/s rd, 21 MiB/s wr, 240 op/s rd, 1.75k op/s wr > > > > > > { > > "epoch": 4962617, > > "inconsistents": [ > > { > > "object": { > > "name": "1000000cc8e.00000000", > > "nspace": "", > > "locator": "", > > "snap": 1, > > "version": 4253817 > > }, > > "errors": [], > > "union_shard_errors": [ > > "omap_digest_mismatch_info" > > ], > > "selected_object_info": { > > "oid": { > > "oid": "1000000cc8e.00000000", > > "key": "", > > "snapid": 1, > > "hash": 1369745244, > > "max": 0, > > "pool": 7, > > "namespace": "" > > }, > > "version": "4962847'6209730", > > "prior_version": "3916665'4306116", > > "last_reqid": "osd.27.0:757107407", > > "user_version": 4253817, > > "size": 0, > > "mtime": "2022-02-26T12:56:55.612420+0100", > > "local_mtime": "2022-02-26T12:56:55.614429+0100", > > "lost": 0, > > "flags": [ > > "dirty", > > "omap", > > "data_digest", > > "omap_digest" > > ], > > "truncate_seq": 0, > > "truncate_size": 0, > > "data_digest": "0xffffffff", > > "omap_digest": "0xe5211a9e", > > "expected_object_size": 0, > > "expected_write_size": 0, > > "alloc_hint_flags": 0, > > "manifest": { > > "type": 0 > > }, > > "watchers": {} > > }, > > "shards": [ > > { > > "osd": 20, > > "primary": false, > > "errors": [ > > "omap_digest_mismatch_info" > > ], > > "size": 0, > > "omap_digest": "0xffffffff", > > "data_digest": "0xffffffff" > > }, > > { > > "osd": 27, > > "primary": true, > > "errors": [ > > "omap_digest_mismatch_info" > > ], > > "size": 0, > > "omap_digest": "0xffffffff", > > "data_digest": "0xffffffff" > > }, > > { > > "osd": 43, > > "primary": false, > > "errors": [ > > "omap_digest_mismatch_info" > > ], > > "size": 0, > > "omap_digest": "0xffffffff", > > "data_digest": "0xffffffff" > > } > > ] > > }, > > > > > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx