Re: recovery_unfound during scrub with auto repair = true

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Sun, 13 Jun 2021 19:04:11 +0200

Last update of the day:

We decided to stop osd.951 and mark it out. The PG went active, the
object was no longer unfound, and backfilling just completed
successfully.
Now we're deep scrubbing one more time, but it looks good!

We'll follow up in the tracker about trying to reproduce this bug and
hopeful find a fix so it doesn't happen again.

Cheers, dan

On Sun, Jun 13, 2021 at 3:36 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Hi again,
>
> We haven't taken any actions yet, but this seems like it might be a bug.
> We compared the version numbers with the osdmap epoch at the time the
> object went unfound -- indeed the osdmap was e3593555 when this PG was
> marked recovery_unfound:
>
> 2021-06-13 03:50:13.808204 mon.cephbeesly-mon-2a00f134e5 (mon.0)
> 1232629 : cluster [DBG] osdmap e3593554: 1224 total, 1222 up, 1222 in
> ...
> 2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR]
> 4.1904 shard 951 soid
> 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate
> had a read error
> ...
> 2021-06-13 04:04:50.359786 mon.cephbeesly-mon-2a00f134e5 (mon.0)
> 1232802 : cluster [DBG] osdmap e3593555: 1224 total, 1222 up, 1222 in
> ...
> 2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904
> push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v
> 3592634'367863320 failed because local copy is 3593555'368312656
> 2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR]
> 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head
> v 3592634'367863320 failed because local copy is 3593555'368312656
> ...
> 2021-06-13 04:31:17.106383 mon.cephbeesly-mon-2a00f134e5 (mon.0)
> 1233177 : cluster [DBG] osdmap e3593556: 1224 total, 1222 up, 1222 in
>
>
> Could this be a race? something like ...
>
> 1. Repair starts, osd.951 has object read error when object is v
> 3592634'367863320.
> 2. Later on during repair, that object is updated by the client,
> bumping it to v 3593555'368312656
> 3. Repair completes a few minutes later, osd.951 asks his peers to
> recover the object at the older v 3592634'367863320 --> they don't
> have it anymore, object is marked lost.
>
> I opened a tracker here: https://tracker.ceph.com/issues/51194
>
> Would be happy to hear any input -- I think we need a way to tell the
> cluster to use the object at v 3593555'368312656.
>
> Cheers, Dan
>
> On Sun, Jun 13, 2021 at 8:58 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >
> > Hi all,
> >
> > The cluster here is running v14.2.20 and is used for RBD images.
> >
> > We have a PG in recovery_unfound state and since this is the first
> > time we've had this occur, we wanted to get your advice on the best
> > course of action.
> >
> > PG 4.1904 went into state active+recovery_unfound+degraded+repair [1]
> > during normal scrubbing (but note that we have `osd scrub auto repair
> > = true`).
> >
> > 2021-06-13 03:15:11.559680 osd.951 (osd.951) 138 : cluster [DBG]
> > 4.1904 repair starts
> > 2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR]
> > 4.1904 shard 951 soid
> > 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate
> > had a read error
> >
> > The scrub detected a read error on the primary of this PG, and tried
> > to repair it by reading from the other 2 osds:
> >
> > Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 FAILED Result:
> > hostbyte=DID_OK driverbyte=DR
> > Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Sense Key :
> > Medium Error [current] [descript
> > Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Add. Sense:
> > Unrecovered read error
> > Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 CDB: Read(16) 88
> > 00 00 00 00 02 ba 8c 0b 00
> > Jun 13 04:00:46 xxx kernel: blk_update_request: critical medium error,
> > dev sdp, sector 1171967531
> >
> > But it seems that the other 2 osds could not repair this failed read
> > on the primary because they don't have the correct version of the
> > object:
> >
> > 2021-06-13 04:28:29.412765 osd.951 (osd.951) 140 : cluster [ERR]
> > 4.1904 repair 0 missing, 1 inconsistent objects
> > 2021-06-13 04:28:29.413320 osd.951 (osd.951) 141 : cluster [ERR]
> > 4.1904 repair 1 errors, 1 fixed
> > 2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904
> > push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v
> > 3592634'367863320 failed because local copy is 3593555'368312656
> > 2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR]
> > 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head
> > v 3592634'367863320 failed because local copy is 3593555'368312656
> > 2021-06-13 04:28:30.863807 mgr.yyy (mgr.692832499) 648287 : cluster
> > [DBG] pgmap v557097: 19456 pgs: 1
> > active+recovery_unfound+degraded+repair, 2 active+clean+scrubbing,
> > 19423 active+clean, 30 active+clean+scrubbing+deep+repair; 1.3 PiB
> > data, 4.0 PiB used, 2.1 PiB / 6.1 PiB avail; 350 MiB/s rd, 766 MiB/s
> > wr, 16.93k op/s; 3/1063641423 objects degraded (0.000%); 1/354547141
> > objects unfound (0.000%)
> >
> > I don't understand how the versions of the objects would get out of
> > sync -- there have been no other recent failures on these disks,
> > AFAICT.
> > So my best guess is that the IO error on 951 confused the repair
> > process -- the osd.951 tried to recover the non-latest version of the
> > object.
> > (This would imply that the object versions on osds 14 and 344 are in
> > fact the correct newest versions).
> >
> > We have a few ideas how to fix this:
> >
> > * osd 951 is sick, so drain it by setting `ceph osd primary-affinity
> > 951 0` and `ceph osd out 951`
> > * osd 951 is really sick, so just stop it now and backfill its PGs to
> > other OSDs.
> > * Don't stop osd 951 yet: Restart all three relevant OSDs and see if
> > that fixes the object versions.
> > * Don't drain osd 951 yet: Make OSD 14 or 344 the primary for this PG,
> > (e.g. ceph osd primary-affinity 951 0) then run `ceph pg repair
> > 4.1904` so that the version from osds 14/344 can be pushed.
> > * Use mark_unfound_lost revert, or delete. (and inform the user their
> > image to fsck their image).
> >
> > Does anyone have some recent experience or advice on this issue?
> >
> > Best Regards,
> >
> > Dan
> >
> > [1]
> > # ceph pg 4.1904 query
> > {
> >     "state": "active+recovery_unfound+degraded+repair",
> >     "snap_trimq": "[1c7fd~1,1c7ff~1,1c801~1,1c803~1,1c805~1]",
> >     "snap_trimq_len": 5,
> >     "epoch": 3593586,
> >     "up": [
> >         951,
> >         344,
> >         14
> >     ],
> >     "acting": [
> >         951,
> >         344,
> >         14
> >     ],
> >     "acting_recovery_backfill": [
> >         "14",
> >         "344",
> >         "951"
> >     ],
> >
> > ...
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx