creating pools/pgs vs split

Sage Weil <sweil@xxxxxxxxxx> · Thu, 5 Apr 2018 13:59:04 +0000 (UTC)

I'm stuck on an annoying issue with wip-peering-fast-dispatch.  There is a 
race between handling of pg_create messages from the mon (pool creation) 
and pg splits.  It goes like this:

- pool is created in epoch E1
- pg A is created somewhere
- it creates on an osd, peers, is happy, but does not report back to mon 
that create succeeded.
- pool pg_num increases, pg A splits into A+B in E2
- layout changes, time passes, etc.  mon still doesn't realize A is 
created.
- on some other osd, B is created and instantiated in E3
- mon sends create for A to same osd

The sharding on the OSD makes it very difficult for the osd to realize 
that it shouldn't actually instantiate A as of E1.  If it does, it will 
walk forward through maps and split into A+B in E2 and crash because pg B 
already exists.  I've gone around in circles several times and the best 
I've come up with is something that creates A, goes to do the split prep 
work, realizes B exists, and has to back up by removing A again.  It's 
very delicate and I'm pretty sure it still leaves open races if subsequent 
splits happen in the meantime (E4) or other children besides B try to peer 
onto the same OSD at the same time.  Let's call this option A.  Best case: 
very complex, hard to explain, hard to test, hard to understand.

Greg, this is the 'can't increase pg_num while pool is creating' issue you 
called out during review.  It triggers a bunch of test failures because 
tests create pools and set pg_num without a step waiting for pool 
creationg to quiesce.

Option B: keep my original workaround to just disallow pg_num changes 
while pools create.  CLI would return EBUSY.  Tests would need to add a 
step that waits for pool create to finish before trying to adjust pg_num.  
Or, mon command could block (indefinitely!) waiting for it to happen (not 
keen on this one since OSDs might not be up and it could take forever).

Option C: make the mon pay attention to splits and send the pg_create for 
A *and* as of E2. This means the mon will provide the PastIntervals and 
pg_history_t.  Shifts complexity to the mon because it now has to pay 
close attention to splits and manage additional associated metadata.  The 
pg_create message is already changing as part of this branch 
(MOSDPGCreate2) so no compat issues; a luminous->mimic mixed cluster would
be susceptible to the weird race conditions with racing pool create + 
split, though (not a concern, I think, since this is not something that 
really happens outside of stress testing).

Option D: reconsider the way pools are created.  Instead of creating pool 
foo with 65536 PGs, create it with 1 pg.  Have the mon register a 
pg_num_target value and have 'set pg_num ...' modify this.  Once it (or 
the mgr) sees that the pg has successfully created itself, *then* increase 
pg_num in a controlled way, probably by jumping by powers of 2.  This is 
actually a more efficient way to create a large pool at a large 
cluster, probably, since we replace an linear mon sending out 
pg create messages process with an exponential osd<->osd operation that 
should be more like O(lg n).  It also shifts stress to the pg split 
path for all cases, which means (hopefully) more thorough testing/coverage 
and fewer code paths to worry about.

I'm leaning toward option D because I think it also fits neatly into what 
the mgr will do with pg_num reductions/merge anyway.  It also neatly 
separates out the intended/target pg_num from the actual pg_num the OSDs 
are working to realize.

Thoughts?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html