Hi again, We haven't taken any actions yet, but this seems like it might be a bug. We compared the version numbers with the osdmap epoch at the time the object went unfound -- indeed the osdmap was e3593555 when this PG was marked recovery_unfound: 2021-06-13 03:50:13.808204 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1232629 : cluster [DBG] osdmap e3593554: 1224 total, 1222 up, 1222 in ... 2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR] 4.1904 shard 951 soid 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate had a read error ... 2021-06-13 04:04:50.359786 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1232802 : cluster [DBG] osdmap e3593555: 1224 total, 1222 up, 1222 in ... 2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656 2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR] 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v 3592634'367863320 failed because local copy is 3593555'368312656 ... 2021-06-13 04:31:17.106383 mon.cephbeesly-mon-2a00f134e5 (mon.0) 1233177 : cluster [DBG] osdmap e3593556: 1224 total, 1222 up, 1222 in Could this be a race? something like ... 1. Repair starts, osd.951 has object read error when object is v 3592634'367863320. 2. Later on during repair, that object is updated by the client, bumping it to v 3593555'368312656 3. Repair completes a few minutes later, osd.951 asks his peers to recover the object at the older v 3592634'367863320 --> they don't have it anymore, object is marked lost. I opened a tracker here: https://tracker.ceph.com/issues/51194 Would be happy to hear any input -- I think we need a way to tell the cluster to use the object at v 3593555'368312656. Cheers, Dan On Sun, Jun 13, 2021 at 8:58 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > Hi all, > > The cluster here is running v14.2.20 and is used for RBD images. > > We have a PG in recovery_unfound state and since this is the first > time we've had this occur, we wanted to get your advice on the best > course of action. > > PG 4.1904 went into state active+recovery_unfound+degraded+repair [1] > during normal scrubbing (but note that we have `osd scrub auto repair > = true`). > > 2021-06-13 03:15:11.559680 osd.951 (osd.951) 138 : cluster [DBG] > 4.1904 repair starts > 2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR] > 4.1904 shard 951 soid > 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate > had a read error > > The scrub detected a read error on the primary of this PG, and tried > to repair it by reading from the other 2 osds: > > Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 FAILED Result: > hostbyte=DID_OK driverbyte=DR > Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Sense Key : > Medium Error [current] [descript > Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Add. Sense: > Unrecovered read error > Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 CDB: Read(16) 88 > 00 00 00 00 02 ba 8c 0b 00 > Jun 13 04:00:46 xxx kernel: blk_update_request: critical medium error, > dev sdp, sector 1171967531 > > But it seems that the other 2 osds could not repair this failed read > on the primary because they don't have the correct version of the > object: > > 2021-06-13 04:28:29.412765 osd.951 (osd.951) 140 : cluster [ERR] > 4.1904 repair 0 missing, 1 inconsistent objects > 2021-06-13 04:28:29.413320 osd.951 (osd.951) 141 : cluster [ERR] > 4.1904 repair 1 errors, 1 fixed > 2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904 > push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v > 3592634'367863320 failed because local copy is 3593555'368312656 > 2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR] > 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head > v 3592634'367863320 failed because local copy is 3593555'368312656 > 2021-06-13 04:28:30.863807 mgr.yyy (mgr.692832499) 648287 : cluster > [DBG] pgmap v557097: 19456 pgs: 1 > active+recovery_unfound+degraded+repair, 2 active+clean+scrubbing, > 19423 active+clean, 30 active+clean+scrubbing+deep+repair; 1.3 PiB > data, 4.0 PiB used, 2.1 PiB / 6.1 PiB avail; 350 MiB/s rd, 766 MiB/s > wr, 16.93k op/s; 3/1063641423 objects degraded (0.000%); 1/354547141 > objects unfound (0.000%) > > I don't understand how the versions of the objects would get out of > sync -- there have been no other recent failures on these disks, > AFAICT. > So my best guess is that the IO error on 951 confused the repair > process -- the osd.951 tried to recover the non-latest version of the > object. > (This would imply that the object versions on osds 14 and 344 are in > fact the correct newest versions). > > We have a few ideas how to fix this: > > * osd 951 is sick, so drain it by setting `ceph osd primary-affinity > 951 0` and `ceph osd out 951` > * osd 951 is really sick, so just stop it now and backfill its PGs to > other OSDs. > * Don't stop osd 951 yet: Restart all three relevant OSDs and see if > that fixes the object versions. > * Don't drain osd 951 yet: Make OSD 14 or 344 the primary for this PG, > (e.g. ceph osd primary-affinity 951 0) then run `ceph pg repair > 4.1904` so that the version from osds 14/344 can be pushed. > * Use mark_unfound_lost revert, or delete. (and inform the user their > image to fsck their image). > > Does anyone have some recent experience or advice on this issue? > > Best Regards, > > Dan > > [1] > # ceph pg 4.1904 query > { > "state": "active+recovery_unfound+degraded+repair", > "snap_trimq": "[1c7fd~1,1c7ff~1,1c801~1,1c803~1,1c805~1]", > "snap_trimq_len": 5, > "epoch": 3593586, > "up": [ > 951, > 344, > 14 > ], > "acting": [ > 951, > 344, > 14 > ], > "acting_recovery_backfill": [ > "14", > "344", > "951" > ], > > ... _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx