I think all we need is at least one complete peer from the most recent interval which served writes. Taking the max history.les over a set of infos containing at least one info from the acting set from each interval back to some history.last_epoch_started value gives us the most recent interval which could have accepted writes. Any complete peer with info.les >= that value would thus fit the bill. I think that if we receive an incomplete info from an osd x in the acting set in interval i, it proves that even if maybe_went_rw is true (doesn't violate min_size or up_thru, etc) we cannot have gone active in interval i. Thus, filtering out incomplete infos in choose_acting can't remove an info from an interval which might have updated history.last_epoch_started even if one of those infos happened to be the only one from a particular interval. I think the additional checks from c7d92d1d3fe469f5e8e7c35185a670570c665029 might actually leave us down in a recoverable case. Not entirely sure though, still looking into it. -Sam On Fri, Dec 9, 2016 at 4:42 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Fri, 9 Dec 2016, Samuel Just wrote: >> Oh, I haven't gotten to that part yet. I was referring to >> c7d92d1d3fe469f5e8e7c35185a670570c665029 which duplicates some of the >> logic in the PriorSet constructor to verify after we have the infos >> that every interval has at least one complete peer >> (last_backfill=max). I think it mattered in early 2012 when backfill >> peers existed in the acting set, but now it is either unnecessary or >> better handled in choose_acting. > > Oh I see. > > If we don't have this check, then *somewhere else* we have to be verifying > that we have at least one (that is !incomplete) from the last interval or > else we can't be sure we aren't missing writes. ...right? My memory > is hazy here. I can't rmeember if there is a secondary effect wehre we > can infer that we didn't go active in the later interval because the > OSDs in the earlier interval haven't advanced last_epoch_started... > > s > > > >> ATM, I'm trying to prune down the number of questions asked of >> past_intervals to see how much I can leave out. > > >> -Sam >> >> On Fri, Dec 9, 2016 at 4:04 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> > I'm not quite following. The criterion I used before was that an interval >> > goes into past_intervals iff maybe_went_rw=true. In this case, it would >> > be true; we wouldn't know what infos we get until partway through peering >> > anyway. >> > >> > (The other thing I did was, after the fact, filter out older intervals >> > that have duplicate up[_primary]/acting[_primary].) >> > >> > sage >> > >> > On Fri, 9 Dec 2016, Samuel Just wrote: >> > >> >> Yeah, I don't think it does. If by some chance we only got an >> >> incomplete info from such an interval, it doesn't matter whether >> >> choose_acting skips it or not when determining the most recent active >> >> interval since it couldn't have changed last_epoch_started or >> >> history.last_epoch_started. I think I can just kill that section. >> >> -Sam >> >> >> >> On Fri, Dec 9, 2016 at 3:41 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> >> > Currently trying to figure out whether >> >> > c7d92d1d3fe469f5e8e7c35185a670570c665029 matters (any interval where >> >> > acting contains an incomplete peer is could not have gone active, >> >> > right?) >> >> > -Sam >> >> > >> >> > On Fri, Dec 9, 2016 at 3:40 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> >> >> Yeah, that's pretty much where I'm going. >> >> >> -Sam >> >> >> >> >> >> On Fri, Dec 9, 2016 at 3:24 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> >>> On Fri, 9 Dec 2016, Samuel Just wrote: >> >> >>>> Currently, never. However, I'm thinking that we might want to retain >> >> >>>> the freedom to not send the structure if it's really big. And >> >> >>>> actually, we won't ever need to extend a received past_intervals >> >> >>>> structure to the current epoch since if the interval changed, we'd >> >> >>>> throw out the whole message. >> >> >>> >> >> >>> I'd say throw out the incremental build code, then, and assert >> >> >>> past_intervals is present at notify time; we can re-add something to >> >> >>> recalculate the whole past_intervals if it becomes necessary in the >> >> >>> future. >> >> >>> >> >> >>> sage >> >> >>> >> >> >>> >> >> >>>> -Sam >> >> >>>> >> >> >>>> On Fri, Dec 9, 2016 at 2:16 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> >>>> > On Fri, 9 Dec 2016, Samuel Just wrote: >> >> >>>> >> In particular, we don't need PG::generate_past_intervals duplicating >> >> >>>> >> the logic in build_past_intervals_parallel since constructed PG >> >> >>>> >> objects only ever need to maintain a consistent past_intervals >> >> >>>> >> structure, never build it from scratch. >> >> >>>> > >> >> >>>> > Sounds good to me. >> >> >>>> > >> >> >>>> > My main question is: >> >> >>>> > >> >> >>>> >> On Fri, Dec 9, 2016 at 11:59 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> >> >>>> >> > There's code for dealing with some odd past_intervals configurations >> >> >>>> >> > including doesn't go back far enough and doesn't go forward far >> >> >>>> >> > enough. I *think* we can simplify this as follows: >> >> >>>> >> > 1) Once the PG object is constructed and in memory, past_intervals >> >> >>>> >> > extends from history.last_epoch_started to the PG's current map >> >> >>>> >> > 2) On disk, either 1) is true or the set is empty (after an import) >> >> >>>> >> > >> >> >>>> >> > On boot, the OSD generates past_intervals in parallel for any PGs >> >> >>>> >> > without them (and perhaps verifies 1) for the rest). On receipt of a >> >> >>>> >> > Notify creating a PG, the OSD generates the past_intervals structure >> >> >>>> >> > before instantiating the PG using the same process as on boot -- >> >> >>>> >> > starting with the included past_intervals if present (may not extend >> >> >>>> >> > to the current map, and so may need to be extended). >> >> >>>> > >> >> >>>> > When does this actually happen? If PGs are always in state 1, can we >> >> >>>> > instead ensure that PG notify will always include past_intervals and that >> >> >>>> > a received notify will never require us to go off do the (slow) work of >> >> >>>> > loading up old maps to generate old intervals? >> >> >>>> > >> >> >>>> > Then, we can focus our efforts on make the past_intervals representation >> >> >>>> > compact (e.g., by discarding redundant intervals)... >> >> >>>> > >> >> >>>> > sage >> >> >>>> -- >> >> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> >> >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >>>> >> >> >>>> >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html