Followup. This is what's written in logs when I try to fix one PG: ceph pg repair 3.60 primary osd log: 2021-06-25 01:07:32.146 7fc006339700 -1 log_channel(cluster) log [ERR] : repair 3.53 3:cb4336ff:::rbd_data.e2d302dd699130.00000000000069b3:6aa5 : is an unexpected clone 2021-06-25 01:07:32.146 7fc006339700 -1 osd.6 pg_epoch: 210926 pg[3.53( v 210926'64271902 (210920'64268839,210926'64271902] local-lis/les=210882/210883 n=6046 ec=56/56 lis/c 210882/210882 les/c/f 210883/210883/5620 210811/210882/210882) [6,22,12] r=0 lpr=210882 luod=210926'64271899 crt=210926'64271902 lcod 210926'64271898 mlcod 210926'64271898 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps no clone_snaps for 3:cb4336ff:::rbd_data.e2d302dd699130.00000000000069b3:6aa5 in 6aa5=[6aa5]:{} secondary osd 1: 2021-06-25 01:07:31.934 7f9eae8fa700 -1 osd.22 pg_epoch: 210926 pg[3.53( v 210926'64271899 (210920'64268839,210926'64271899] local-lis/les=210882/210883 n=6046 ec=56/56 lis/c 210882/210882 les/c/f 210883/210883/5620 210811/210882/210882) [6,22,12] r=1 lpr=210882 luod=0'0 lua=210881'64265352 crt=210926'64271899 lcod 210926'64271898 active+inconsistent mbc={}] _scan_snaps no clone_snaps for 3:cb4336ff:::rbd_data.e2d302dd699130.00000000000069b3:6aa5 in 6aa5=[6aa5]:{} secondary osd 2: 2021-06-25 01:07:30.828 7f94d6e61700 -1 osd.12 pg_epoch: 210926 pg[3.53( v 210926'64271899 (210920'64268839,210926'64271899] local-lis/les=210882/210883 n=6046 ec=56/56 lis/c 210882/210882 les/c/f 210883/210883/5620 210811/210882/210882) [6,22,12] r=2 lpr=210882 luod=0'0 lua=210881'64265352 crt=210926'64271899 lcod 210926'64271898 active+inconsistent mbc={}] _scan_snaps no clone_snaps for 3:cb4336ff:::rbd_data.e2d302dd699130.00000000000069b3:6aa5 in 6aa5=[6aa5]:{} And nothing happens, it's still in a failed_repair state. пт, 25 июн. 2021 г. в 00:36, Vladimir Prokofev <v@xxxxxxxxxxx>: > Hello. > > Today we've experienced a complete CEPH cluster outage - total loss of > power in the whole infrastructure. > 6 osd nodes and 3 monitors went down at the same time. CEPH 14.2.10 > > This resulted in unfound objects, which were "reverted" in a hurry with > ceph pg <pg_id> mark_unfound_lost revert > In retrospect that was probably a mistake as the "have" part stated 0'0. > > But then deep-scrubs started and they found inconsistent PGs. We tried > repairing them, but they just switched to failed_repair. > > Here's a log example: > 2021-06-25 00:08:07.693645 osd.0 [ERR] 3.c shard 6 > 3:3163e703:::rbd_data.be08c566ef438d.0000000000002445:head : missing > 2021-06-25 00:08:07.693710 osd.0 [ERR] repair 3.c > 3:3163e2ee:::rbd_data.efa86358d15f4a.000000000000004b:6ab1 : is an > unexpected clone > 2021-06-25 00:11:55.128951 osd.0 [ERR] 3.c repair 1 missing, 0 > inconsistent objects > 2021-06-25 00:11:55.128969 osd.0 [ERR] 3.c repair 2 errors, 1 fixed > > I tried manually deleting conflicting objects from secondary osds > with ceph-objectstore-tool like this > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 --pgid 3.c > rbd_data.efa86358d15f4a.000000000000004b:6ab1 remove > it removes it but without any positive impact. Pretty sure I don't > understand the concept. > > So currently I have the following thoughts: > - is there any doc on the object placement specifics and what all of > those numbers in their name mean? I've seen objects with similar prefix/mid > but different suffix and I have no idea what does it mean; > - I'm actually not sure what the production impact is at that point > because everything seems to work so far. So I'm thinking if it's possible > to kill replicas on secondary OSDd with ceph-objectstore-tool and just let > CEPH create a replica from primary PG? > > I have 8 scrub errors and 4 inconsistent+failed_repair PGs, and I'm afraid > that further deep scrubs will reveal more errors. > Any thoughts appreciated. > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx