Hi all, The cluster here is running v14.2.20 and is used for RBD images. We have a PG in recovery_unfound state and since this is the first time we've had this occur, we wanted to get your advice on the best course of action. PG 4.1904 went into state active+recovery_unfound+degraded+repair [1] during normal scrubbing (but note that we have `osd scrub auto repair = true`). 2021-06-13 03:15:11.559680 osd.951 (osd.951) 138 : cluster [DBG] 4.1904 repair starts 2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR] 4.1904 shard 951 soid 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate had a read error The scrub detected a read error on the primary of this PG, and tried to repair it by reading from the other 2 osds: Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DR Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Sense Key : Medium Error [current] [descript Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Add. Sense: Unrecovered read error Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 CDB: Read(16) 88 00 00 00 00 02 ba 8c 0b 00 Jun 13 04:00:46 xxx kernel: blk_update_request: critical medium error, dev sdp, sector 1171967531 But it seems that the other 2 osds could not repair this failed read on the primary because they don't have the correct version of the object: 2021-06-13 04:28:29.412765 osd.951 (osd.951) 140 : cluster [ERR] 4.1904 repair 0 missing, 1 inconsistent objects 2021-06-13 04:28:29.413320 osd.951 (osd.951) 141 : cluster [ERR] 4.1904 repair 1 errors, 1 fixed 2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656 2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656 2021-06-13 04:28:30.863807 mgr.yyy (mgr.692832499) 648287 : cluster [DBG] pgmap v557097: 19456 pgs: 1 active+recovery_unfound+degraded+repair, 2 active+clean+scrubbing, 19423 active+clean, 30 active+clean+scrubbing+deep+repair; 1.3 PiB data, 4.0 PiB used, 2.1 PiB / 6.1 PiB avail; 350 MiB/s rd, 766 MiB/s wr, 16.93k op/s; 3/1063641423 objects degraded (0.000%); 1/354547141 objects unfound (0.000%) I don't understand how the versions of the objects would get out of sync -- there have been no other recent failures on these disks, AFAICT. So my best guess is that the IO error on 951 confused the repair process -- the osd.951 tried to recover the non-latest version of the object. (This would imply that the object versions on osds 14 and 344 are in fact the correct newest versions). We have a few ideas how to fix this: * osd 951 is sick, so drain it by setting `ceph osd primary-affinity 951 0` and `ceph osd out 951` * osd 951 is really sick, so just stop it now and backfill its PGs to other OSDs. * Don't stop osd 951 yet: Restart all three relevant OSDs and see if that fixes the object versions. * Don't drain osd 951 yet: Make OSD 14 or 344 the primary for this PG, (e.g. ceph osd primary-affinity 951 0) then run `ceph pg repair 4.1904` so that the version from osds 14/344 can be pushed. * Use mark_unfound_lost revert, or delete. (and inform the user their image to fsck their image). Does anyone have some recent experience or advice on this issue? Best Regards, Dan [1] # ceph pg 4.1904 query { "state": "active+recovery_unfound+degraded+repair", "snap_trimq": "[1c7fd~1,1c7ff~1,1c801~1,1c803~1,1c805~1]", "snap_trimq_len": 5, "epoch": 3593586, "up": [ 951, 344, 14 ], "acting": [ 951, 344, 14 ], "acting_recovery_backfill": [ "14", "344", "951" ], ... _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx