Re: recovery_unfound during scrub with auto repair = true

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Sun, 13 Jun 2021 15:36:53 +0200

Hi again,

We haven't taken any actions yet, but this seems like it might be a bug.
We compared the version numbers with the osdmap epoch at the time the
object went unfound -- indeed the osdmap was e3593555 when this PG was
marked recovery_unfound:

2021-06-13 03:50:13.808204 mon.cephbeesly-mon-2a00f134e5 (mon.0)
1232629 : cluster [DBG] osdmap e3593554: 1224 total, 1222 up, 1222 in
...
2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR]
4.1904 shard 951 soid
4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate
had a read error
...
2021-06-13 04:04:50.359786 mon.cephbeesly-mon-2a00f134e5 (mon.0)
1232802 : cluster [DBG] osdmap e3593555: 1224 total, 1222 up, 1222 in
...
2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904
push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v
3592634'367863320 failed because local copy is 3593555'368312656
2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR]
4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head
v 3592634'367863320 failed because local copy is 3593555'368312656
...
2021-06-13 04:31:17.106383 mon.cephbeesly-mon-2a00f134e5 (mon.0)
1233177 : cluster [DBG] osdmap e3593556: 1224 total, 1222 up, 1222 in

Could this be a race? something like ...

1. Repair starts, osd.951 has object read error when object is v
3592634'367863320.
2. Later on during repair, that object is updated by the client,
bumping it to v 3593555'368312656
3. Repair completes a few minutes later, osd.951 asks his peers to
recover the object at the older v 3592634'367863320 --> they don't
have it anymore, object is marked lost.

I opened a tracker here: https://tracker.ceph.com/issues/51194

Would be happy to hear any input -- I think we need a way to tell the
cluster to use the object at v 3593555'368312656.

Cheers, Dan

On Sun, Jun 13, 2021 at 8:58 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Hi all,
>
> The cluster here is running v14.2.20 and is used for RBD images.
>
> We have a PG in recovery_unfound state and since this is the first
> time we've had this occur, we wanted to get your advice on the best
> course of action.
>
> PG 4.1904 went into state active+recovery_unfound+degraded+repair [1]
> during normal scrubbing (but note that we have `osd scrub auto repair
> = true`).
>
> 2021-06-13 03:15:11.559680 osd.951 (osd.951) 138 : cluster [DBG]
> 4.1904 repair starts
> 2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR]
> 4.1904 shard 951 soid
> 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate
> had a read error
>
> The scrub detected a read error on the primary of this PG, and tried
> to repair it by reading from the other 2 osds:
>
> Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 FAILED Result:
> hostbyte=DID_OK driverbyte=DR
> Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Sense Key :
> Medium Error [current] [descript
> Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Add. Sense:
> Unrecovered read error
> Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 CDB: Read(16) 88
> 00 00 00 00 02 ba 8c 0b 00
> Jun 13 04:00:46 xxx kernel: blk_update_request: critical medium error,
> dev sdp, sector 1171967531
>
> But it seems that the other 2 osds could not repair this failed read
> on the primary because they don't have the correct version of the
> object:
>
> 2021-06-13 04:28:29.412765 osd.951 (osd.951) 140 : cluster [ERR]
> 4.1904 repair 0 missing, 1 inconsistent objects
> 2021-06-13 04:28:29.413320 osd.951 (osd.951) 141 : cluster [ERR]
> 4.1904 repair 1 errors, 1 fixed
> 2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904
> push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v
> 3592634'367863320 failed because local copy is 3593555'368312656
> 2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR]
> 4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head
> v 3592634'367863320 failed because local copy is 3593555'368312656
> 2021-06-13 04:28:30.863807 mgr.yyy (mgr.692832499) 648287 : cluster
> [DBG] pgmap v557097: 19456 pgs: 1
> active+recovery_unfound+degraded+repair, 2 active+clean+scrubbing,
> 19423 active+clean, 30 active+clean+scrubbing+deep+repair; 1.3 PiB
> data, 4.0 PiB used, 2.1 PiB / 6.1 PiB avail; 350 MiB/s rd, 766 MiB/s
> wr, 16.93k op/s; 3/1063641423 objects degraded (0.000%); 1/354547141
> objects unfound (0.000%)
>
> I don't understand how the versions of the objects would get out of
> sync -- there have been no other recent failures on these disks,
> AFAICT.
> So my best guess is that the IO error on 951 confused the repair
> process -- the osd.951 tried to recover the non-latest version of the
> object.
> (This would imply that the object versions on osds 14 and 344 are in
> fact the correct newest versions).
>
> We have a few ideas how to fix this:
>
> * osd 951 is sick, so drain it by setting `ceph osd primary-affinity
> 951 0` and `ceph osd out 951`
> * osd 951 is really sick, so just stop it now and backfill its PGs to
> other OSDs.
> * Don't stop osd 951 yet: Restart all three relevant OSDs and see if
> that fixes the object versions.
> * Don't drain osd 951 yet: Make OSD 14 or 344 the primary for this PG,
> (e.g. ceph osd primary-affinity 951 0) then run `ceph pg repair
> 4.1904` so that the version from osds 14/344 can be pushed.
> * Use mark_unfound_lost revert, or delete. (and inform the user their
> image to fsck their image).
>
> Does anyone have some recent experience or advice on this issue?
>
> Best Regards,
>
> Dan
>
> [1]
> # ceph pg 4.1904 query
> {
>     "state": "active+recovery_unfound+degraded+repair",
>     "snap_trimq": "[1c7fd~1,1c7ff~1,1c801~1,1c803~1,1c805~1]",
>     "snap_trimq_len": 5,
>     "epoch": 3593586,
>     "up": [
>         951,
>         344,
>         14
>     ],
>     "acting": [
>         951,
>         344,
>         14
>     ],
>     "acting_recovery_backfill": [
>         "14",
>         "344",
>         "951"
>     ],
>
> ...
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx