Re: backfill_unfound state reset to clean after osd restart

"Jin Hase" <hase.jin@xxxxxxxxxxxxxx> · Wed, 19 May 2021 09:39:08 -0000

> Suppose we have a 2+1 EC pool, and an object is missing 2 shards on
> both non-primary osds. We initiate backfill by setting a non-primary
> osd out. During the backfill the primary osd detects the missing
> shards and the pg enters "backfill_unfound" state, the last_backfill
> position is properly set to the object before the "unfound" (in
> post-nautilus, for nautilus I opened [1] to make it work). If
> re-peering occurs due to a non-primary osd is restarted, the backfill
> is restarted from the last_backfill position and the "unfound" object
> is detected again. But if re-peering occurs due the primary osd is
> temporarily stopped (restarted), another non-primary osd becomes
> primary and "drives" the backfill from the last_backfill position, and
> as the shard is missing here it is just skipped from the backfill, the
> missing object is not detected and the pg enters clean state.
> 
> Is there something that can/should be improved here? It is rather
> unfortunate that the information about missing object is lost on the
> restart (until scrub or next backfill). On the other hand the
> situation when we have many shards are missing for an object is rather
> unlikely. Also, if for example it happened that the shard was missing
> on the primary it would not even be detected on backfill.
> 
> [1] https://github.com/ceph/ceph/pull/41293

In the case of primary osd, is there a case where the user wants to reset the state (from unfound state)?
If we fix this behavior, is there another problem because we can't reset the state?

--
Jin
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx