> Suppose we have a 2+1 EC pool, and an object is missing 2 shards on > both non-primary osds. We initiate backfill by setting a non-primary > osd out. During the backfill the primary osd detects the missing > shards and the pg enters "backfill_unfound" state, the last_backfill > position is properly set to the object before the "unfound" (in > post-nautilus, for nautilus I opened [1] to make it work). If > re-peering occurs due to a non-primary osd is restarted, the backfill > is restarted from the last_backfill position and the "unfound" object > is detected again. But if re-peering occurs due the primary osd is > temporarily stopped (restarted), another non-primary osd becomes > primary and "drives" the backfill from the last_backfill position, and > as the shard is missing here it is just skipped from the backfill, the > missing object is not detected and the pg enters clean state. > > Is there something that can/should be improved here? It is rather > unfortunate that the information about missing object is lost on the > restart (until scrub or next backfill). On the other hand the > situation when we have many shards are missing for an object is rather > unlikely. Also, if for example it happened that the shard was missing > on the primary it would not even be detected on backfill. > > [1] https://github.com/ceph/ceph/pull/41293 In the case of primary osd, is there a case where the user wants to reset the state (from unfound state)? If we fix this behavior, is there another problem because we can't reset the state? -- Jin _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx