Re: creating pools/pgs vs split

Sage Weil <sweil@xxxxxxxxxx> · Thu, 5 Apr 2018 16:34:35 +0000 (UTC)

On Thu, 5 Apr 2018, Gregory Farnum wrote:
> On Thu, Apr 5, 2018 at 6:59 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > I'm stuck on an annoying issue with wip-peering-fast-dispatch.  There is a
> > race between handling of pg_create messages from the mon (pool creation)
> > and pg splits.  It goes like this:
> >
> > - pool is created in epoch E1
> > - pg A is created somewhere
> > - it creates on an osd, peers, is happy, but does not report back to mon
> > that create succeeded.
> > - pool pg_num increases, pg A splits into A+B in E2
> > - layout changes, time passes, etc.  mon still doesn't realize A is
> > created.
> > - on some other osd, B is created and instantiated in E3
> > - mon sends create for A to same osd
> >
> > The sharding on the OSD makes it very difficult for the osd to realize
> > that it shouldn't actually instantiate A as of E1.  If it does, it will
> > walk forward through maps and split into A+B in E2 and crash because pg B
> > already exists.  I've gone around in circles several times and the best
> > I've come up with is something that creates A, goes to do the split prep
> > work, realizes B exists, and has to back up by removing A again.  It's
> > very delicate and I'm pretty sure it still leaves open races if subsequent
> > splits happen in the meantime (E4) or other children besides B try to peer
> > onto the same OSD at the same time.  Let's call this option A.  Best case:
> > very complex, hard to explain, hard to test, hard to understand.
> 
> Okay, no, the answer here is not for the OSD to somehow realize it
> doesn't create PG A in epoch E. That can't ever be an answer because
> that's a global knowledge problem. I don't remember when we started

It's a global knowledge problem in the sense that the pg create should 
logically check that split children don't exist on any other shard *and* 
prime those slots on those shards *and* instantiage the new pg slot 
atomically, but the way the per-shard locking works makes that not work.  
(And even if we did take all the shard locks, it won't translate to 
seastar.)

> the monitor sending out "create PG B" messages after split happened
> during create, but that's the problem — it is, as you've noticed,
> fundamentally racy unless the "create B" message includes enough
> information for the target OSD to know it has to do some kind of
> peering process with the host of A. :(

The create messages are tagged with the pool creation epoch.  The PG is 
instantiated with that epoch and allowed to walk forward in time (and 
split as needed), and the normal peering PriorSet machinery works in the 
normal way to make sure we discover and capture any prevoius instance of 
the pg that was created.  The split child PGs are never created by the 
mon--only the PGs for the pools pg_num as of pool creation.

> (It is allowed to do the dual creates if it never sent out a create
> message for the PG in question, but never afterwards! It sounds like
> maybe this refactor and increased parallelism is exposing some issues
> we've had for a while without noticing.)

The difference between mon triggering PG create and OSD peering messages 
triggering PG creates is that the older peering messages expire and can be 
dropped at interval boundaries, while the mon create messages can't.  This 
makes it possible to filter out old "racy" messages in the scenario I 
described above.  

> > Greg, this is the 'can't increase pg_num while pool is creating' issue you
> > called out during review.  It triggers a bunch of test failures because
> > tests create pools and set pg_num without a step waiting for pool
> > creationg to quiesce.
> >
> > Option B: keep my original workaround to just disallow pg_num changes
> > while pools create.  CLI would return EBUSY.  Tests would need to add a
> > step that waits for pool create to finish before trying to adjust pg_num.
> > Or, mon command could block (indefinitely!) waiting for it to happen (not
> > keen on this one since OSDs might not be up and it could take forever).
> >
> > Option C: make the mon pay attention to splits and send the pg_create for
> > A *and* as of E2. This means the mon will provide the PastIntervals and
> > pg_history_t.  Shifts complexity to the mon because it now has to pay
> > close attention to splits and manage additional associated metadata.  The
> > pg_create message is already changing as part of this branch
> > (MOSDPGCreate2) so no compat issues; a luminous->mimic mixed cluster would
> > be susceptible to the weird race conditions with racing pool create +
> > split, though (not a concern, I think, since this is not something that
> > really happens outside of stress testing).
> >
> > Option D: reconsider the way pools are created.  Instead of creating pool
> > foo with 65536 PGs, create it with 1 pg.  Have the mon register a
> > pg_num_target value and have 'set pg_num ...' modify this.  Once it (or
> > the mgr) sees that the pg has successfully created itself, *then* increase
> > pg_num in a controlled way, probably by jumping by powers of 2.  This is
> > actually a more efficient way to create a large pool at a large
> > cluster, probably, since we replace an linear mon sending out
> > pg create messages process with an exponential osd<->osd operation that
> > should be more like O(lg n).  It also shifts stress to the pg split
> > path for all cases, which means (hopefully) more thorough testing/coverage
> > and fewer code paths to worry about.
> >
> > I'm leaning toward option D because I think it also fits neatly into what
> > the mgr will do with pg_num reductions/merge anyway.  It also neatly
> > separates out the intended/target pg_num from the actual pg_num the OSDs
> > are working to realize.
> 
> Advantage of option D: as noted, it stresses an otherwise rare case
> and works nicely with auto-scaling PGs that we want to do anyway.
> Disadvantage: it's significantly less compatible during cluster
> upgrades and is a big change to how pool creation works. It's not
> clear to me if this is something admins will notice in a meaningful

None of this can kick in (we can't store pg_num_target in pg_pool_t) until 
the upgrade has completed to mimic.  Until that happens, we just do the 
old-style pool creates and the race exists between splits and pool 
creates.  In my view warning users away from creates+splits in the release 
notes is sufficient; I don't think this is something any real user would 
do in the course of a normal workload.

> way. It sounds like you want to use the PG auto-split/merge John has
> been doing, which I believe lives in the manager. That's a big new
> dependency for it I'm not sure we should be comfortable with.

You're worried about the mgr needing to be up in order to create a pool?  
(Or rather, to get a created pool to the initial pg count?)

> On first read I gravitate toward a variant of option C. As I said
> before, we can't update the "create A" message to a new epoch unless
> we've never sent out an older create. But we should be able to do
> something with the "create B" message, perhaps by, uh, not sending it.
> I spent some time looking at how this works now and I'm just not very
> familiar with the new OSDMapMapping and pg create stuff so you can
> probably answer those questions more quickly than I can. But if it's
> not too complicated, it has some notable advantages: more compatible
> with older monitors, probably a fix we can backport, doesn't radically
> change how pool creates work for admins at the end of a cycle (this is
> one i'm a bit worried about), and lives in the monitor without outside
> dependencies.

We already never send a create B; the suggested change would be to *start* 
sending it for B, and to include history and PastItervals for A and B.  It 
requires a big change to the CreatingPGs structure on the mon to include 
the new info and ab unch of processing on the pending pg creates (*and* 
those pgs in the queue), which will get complicated.

I'm still thinking D is easiest to implement, easist to understand, and 
also neatly addresses the ramping problem we'll have with pg merging.

sage