I'm stuck on an annoying issue with wip-peering-fast-dispatch. There is a race between handling of pg_create messages from the mon (pool creation) and pg splits. It goes like this: - pool is created in epoch E1 - pg A is created somewhere - it creates on an osd, peers, is happy, but does not report back to mon that create succeeded. - pool pg_num increases, pg A splits into A+B in E2 - layout changes, time passes, etc. mon still doesn't realize A is created. - on some other osd, B is created and instantiated in E3 - mon sends create for A to same osd The sharding on the OSD makes it very difficult for the osd to realize that it shouldn't actually instantiate A as of E1. If it does, it will walk forward through maps and split into A+B in E2 and crash because pg B already exists. I've gone around in circles several times and the best I've come up with is something that creates A, goes to do the split prep work, realizes B exists, and has to back up by removing A again. It's very delicate and I'm pretty sure it still leaves open races if subsequent splits happen in the meantime (E4) or other children besides B try to peer onto the same OSD at the same time. Let's call this option A. Best case: very complex, hard to explain, hard to test, hard to understand. Greg, this is the 'can't increase pg_num while pool is creating' issue you called out during review. It triggers a bunch of test failures because tests create pools and set pg_num without a step waiting for pool creationg to quiesce. Option B: keep my original workaround to just disallow pg_num changes while pools create. CLI would return EBUSY. Tests would need to add a step that waits for pool create to finish before trying to adjust pg_num. Or, mon command could block (indefinitely!) waiting for it to happen (not keen on this one since OSDs might not be up and it could take forever). Option C: make the mon pay attention to splits and send the pg_create for A *and* as of E2. This means the mon will provide the PastIntervals and pg_history_t. Shifts complexity to the mon because it now has to pay close attention to splits and manage additional associated metadata. The pg_create message is already changing as part of this branch (MOSDPGCreate2) so no compat issues; a luminous->mimic mixed cluster would be susceptible to the weird race conditions with racing pool create + split, though (not a concern, I think, since this is not something that really happens outside of stress testing). Option D: reconsider the way pools are created. Instead of creating pool foo with 65536 PGs, create it with 1 pg. Have the mon register a pg_num_target value and have 'set pg_num ...' modify this. Once it (or the mgr) sees that the pg has successfully created itself, *then* increase pg_num in a controlled way, probably by jumping by powers of 2. This is actually a more efficient way to create a large pool at a large cluster, probably, since we replace an linear mon sending out pg create messages process with an exponential osd<->osd operation that should be more like O(lg n). It also shifts stress to the pg split path for all cases, which means (hopefully) more thorough testing/coverage and fewer code paths to worry about. I'm leaning toward option D because I think it also fits neatly into what the mgr will do with pg_num reductions/merge anyway. It also neatly separates out the intended/target pg_num from the actual pg_num the OSDs are working to realize. Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html