Re: creating pools/pgs vs split

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Apr 5, 2018 at 6:59 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> I'm stuck on an annoying issue with wip-peering-fast-dispatch.  There is a
> race between handling of pg_create messages from the mon (pool creation)
> and pg splits.  It goes like this:
>
> - pool is created in epoch E1
> - pg A is created somewhere
> - it creates on an osd, peers, is happy, but does not report back to mon
> that create succeeded.
> - pool pg_num increases, pg A splits into A+B in E2
> - layout changes, time passes, etc.  mon still doesn't realize A is
> created.
> - on some other osd, B is created and instantiated in E3
> - mon sends create for A to same osd
>
> The sharding on the OSD makes it very difficult for the osd to realize
> that it shouldn't actually instantiate A as of E1.  If it does, it will
> walk forward through maps and split into A+B in E2 and crash because pg B
> already exists.  I've gone around in circles several times and the best
> I've come up with is something that creates A, goes to do the split prep
> work, realizes B exists, and has to back up by removing A again.  It's
> very delicate and I'm pretty sure it still leaves open races if subsequent
> splits happen in the meantime (E4) or other children besides B try to peer
> onto the same OSD at the same time.  Let's call this option A.  Best case:
> very complex, hard to explain, hard to test, hard to understand.

Okay, no, the answer here is not for the OSD to somehow realize it
doesn't create PG A in epoch E. That can't ever be an answer because
that's a global knowledge problem. I don't remember when we started
the monitor sending out "create PG B" messages after split happened
during create, but that's the problem — it is, as you've noticed,
fundamentally racy unless the "create B" message includes enough
information for the target OSD to know it has to do some kind of
peering process with the host of A. :(
(It is allowed to do the dual creates if it never sent out a create
message for the PG in question, but never afterwards! It sounds like
maybe this refactor and increased parallelism is exposing some issues
we've had for a while without noticing.)


> Greg, this is the 'can't increase pg_num while pool is creating' issue you
> called out during review.  It triggers a bunch of test failures because
> tests create pools and set pg_num without a step waiting for pool
> creationg to quiesce.
>
> Option B: keep my original workaround to just disallow pg_num changes
> while pools create.  CLI would return EBUSY.  Tests would need to add a
> step that waits for pool create to finish before trying to adjust pg_num.
> Or, mon command could block (indefinitely!) waiting for it to happen (not
> keen on this one since OSDs might not be up and it could take forever).
>
> Option C: make the mon pay attention to splits and send the pg_create for
> A *and* as of E2. This means the mon will provide the PastIntervals and
> pg_history_t.  Shifts complexity to the mon because it now has to pay
> close attention to splits and manage additional associated metadata.  The
> pg_create message is already changing as part of this branch
> (MOSDPGCreate2) so no compat issues; a luminous->mimic mixed cluster would
> be susceptible to the weird race conditions with racing pool create +
> split, though (not a concern, I think, since this is not something that
> really happens outside of stress testing).
>
> Option D: reconsider the way pools are created.  Instead of creating pool
> foo with 65536 PGs, create it with 1 pg.  Have the mon register a
> pg_num_target value and have 'set pg_num ...' modify this.  Once it (or
> the mgr) sees that the pg has successfully created itself, *then* increase
> pg_num in a controlled way, probably by jumping by powers of 2.  This is
> actually a more efficient way to create a large pool at a large
> cluster, probably, since we replace an linear mon sending out
> pg create messages process with an exponential osd<->osd operation that
> should be more like O(lg n).  It also shifts stress to the pg split
> path for all cases, which means (hopefully) more thorough testing/coverage
> and fewer code paths to worry about.
>
> I'm leaning toward option D because I think it also fits neatly into what
> the mgr will do with pg_num reductions/merge anyway.  It also neatly
> separates out the intended/target pg_num from the actual pg_num the OSDs
> are working to realize.

Advantage of option D: as noted, it stresses an otherwise rare case
and works nicely with auto-scaling PGs that we want to do anyway.
Disadvantage: it's significantly less compatible during cluster
upgrades and is a big change to how pool creation works. It's not
clear to me if this is something admins will notice in a meaningful
way. It sounds like you want to use the PG auto-split/merge John has
been doing, which I believe lives in the manager. That's a big new
dependency for it I'm not sure we should be comfortable with.

On first read I gravitate toward a variant of option C. As I said
before, we can't update the "create A" message to a new epoch unless
we've never sent out an older create. But we should be able to do
something with the "create B" message, perhaps by, uh, not sending it.
I spent some time looking at how this works now and I'm just not very
familiar with the new OSDMapMapping and pg create stuff so you can
probably answer those questions more quickly than I can. But if it's
not too complicated, it has some notable advantages: more compatible
with older monitors, probably a fix we can backport, doesn't radically
change how pool creates work for admins at the end of a cycle (this is
one i'm a bit worried about), and lives in the monitor without outside
dependencies.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux