Re: why degraded object must recover before accept new io

Neha Ojha <nojha@xxxxxxxxxx> · Mon, 6 Aug 2018 09:20:23 -0700

On Sun, Aug 5, 2018 at 10:04 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Sat, Aug 4, 2018 at 7:00 AM zengran zhang <z13121369189@xxxxxxxxx> wrote:
> >
> > >> On Wed, May 30, 2018 at 9:19 AM, zengran zhang <z13121369189@xxxxxxxxx> wrote:
> > >>> Hi
> > >>>     Let's say acting set is [3, 1, 0], obj1 was marked missing on
> > >>> osd.0 after peering, new io on obj1 will wait obj1 until be recovered.
> > >>> So my question is  why cant we do the new io on [3, 1] and let osd.0
> > >>> keep missing obj1 without wait on recover, osd.0 update pglog only
> > >>> like backfill does? if the size of osds with newest object is more
> > >>> than min_size,  do we need to wait recover?
> > >>
> > >> This isn't impossible to do, but it's *yet another* piece of metadata
> > >> we'd need to keep track of and account for in all the other recovery
> > >> and IO paths, so nobody's done it yet. Somebody would have to design
> > >> the algorithms, write the code, and persuade us the UX/performance
> > >> improvement is worth the ongoing maintenance burden. In particular,
> > >> note that any write-before-recovery needs to make sure it still obeys
> > >> the min_size rules, and I don't think any systems are set up to enable
> > >> that tracking separate from the acting set right now.
> > >>
> > >> I haven't looked closely at this code in a while so I don't know how
> > >> easy it would be to implement, or how high the bar for accepting such
> > >> a PR might be. It's not a bad thing to look at AFAIK, though! :)
> > >> -Greg
> >
> > Let's call the modify on degraded object degraded write, of course the write
> > need obeys the min_size rules and object could not missing on primary also.
> >
> > normally/currently we carefully check the past intervals to find out if an
> > object is unrecoverable, this because an interval of OSDs is a promise that
> > any write in that interval was captured by all the interval shards.
> > But degraded break the rule, so there must be a remedy.
> >
> > I think a possible approach is we record the location of the truely happening
> > degraded write into the missing item of the degraded shard. So primary shard
> > could gather the location sets of all degraded write on all shards, I think
> > this would be done on the peering stage: GetMissing.(maybe this not enough,
> > because currently we get missing from actingbackfill only, so unfound item
> > need extra verify..). This could let us to turn the pg to incomplete when some
> > shards must probe.
> >
> > I kown there is already an approach of async recovery, but i think we could do
> > better.
> >
> > Some discussion and fewer code on https://github.com/ceph/ceph/pull/22505.
> >
> > regards!
>
>
> [Re-send now that I finished the email and took the HTML off so vger is happy.]
>
> Copying the end of that discussion to the list.
> Zengran Zhang wrote:
> > sorry.. the a bit more detail :
> > the choose_async_recovery_{replicated | ec} function choose the num entries which fall behind auth shard approximately,
> > approx_entries = auth_version - candidate_version;
> > then use the approxmate cost compare with an option value(osd_async_recovery_min_pg_log_entries=100) to choose candidates. so..
> > [1]. the approxmate cost is not base on really recovery cost, awfully circumstance is object with huge omap entries..
> > [2]. the default value of osd_async_recovery_min_pg_log_entries is arbitrary..
> > [3]. the num of async target is restricted to le than (size - min_size).
> >
> > on particular circumstance like upgrade osd one by one, each shard of each pg may have some fall behind entries, then the reason [3] will be an upper bound limit, consider the especial case, ec 8:3 or 4:2 pool..
>
> Sage wrote:
> > I think for [1], we can address that later when we have the partial recovery metadata in the log entries (e.g., which extent(s) in the object were modified) to estimate the recovery cost. Unfortunately we don't have a way to estimate the omap size currently.
> >
> > For [3], I'm not sure about the min_size and size relationship to the async targets.. @neha-ojha ?
> >
> > I'm going to close this PR since I don't think the code is directly relevant to our discussion any more :). Perhaps we should move this conversation to the ceph-devel email list?
>
> I thought the limit on async targets is merely that we maintain the
> required min_size (so if the PG moves a lot we could have 3 acting and
> 2 async on a size 3 pool), but I may be making that up based on a
> badly-remembered description. Neha?

That is correct.

- Neha

>
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html