Re: creating pools/pgs vs split

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 5 Apr 2018, Sage Weil wrote:
> On Thu, 5 Apr 2018, Sage Weil wrote:
> > On Thu, 5 Apr 2018, Gregory Farnum wrote:
> > > On Thu, Apr 5, 2018 at 6:59 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > > > I'm stuck on an annoying issue with wip-peering-fast-dispatch.  There is a
> > > > race between handling of pg_create messages from the mon (pool creation)
> > > > and pg splits.  It goes like this:
> > > >
> > > > - pool is created in epoch E1
> > > > - pg A is created somewhere
> > > > - it creates on an osd, peers, is happy, but does not report back to mon
> > > > that create succeeded.
> > > > - pool pg_num increases, pg A splits into A+B in E2
> > > > - layout changes, time passes, etc.  mon still doesn't realize A is
> > > > created.
> > > > - on some other osd, B is created and instantiated in E3
> > > > - mon sends create for A to same osd
> 
> A missing piece of the explanation is what is different between 
> wip-peering-fast-dispatch and master.  It turns out master is also broken, 
> but in a different way:
> 
> http://tracker.ceph.com/issues/22165
> 
> There, we create A with E3, and a pre-generated history.  This leads 
> to a different possible output that child B is never created in a 
> different sequence of events:
> 
> - pg A was never created yet
> - osd went down
> - split
> - osd comes up
> - osd gets pg_create on A
> - does not process a split, does not get pg_create for B
> 
> so the pool create basically never finishes.  The user has to delete the 
> pool and try again.
> 
> There is also an unrelated bug in master where the mon is told the pg is 
> created when the primary osd queues the create, which means it could fail 
> and the PG create doesn't get retried (and is never finished).  The 
> wip-peering-fast-dispatch fixes this by only acking the create after the 
> pg has activated.  (This is probably partly why we see the original 
> scenario: we are more careful to tell the mon the pg is created, which 
> means that resent pg_create messages are more common.)
> 
> wip-peering-fast-dispatch has a different PG creation approach: we 
> instantiate the pg (A) in the original pool creation epoch, and let it 
> roll forward through maps and peering and everything else.  Peering is 
> already very robust so this captures all of the splits and prior osds and 
> so on... it just doesn't expect a split child to already exist.

So,

Option E(? I'm losing count): make pg create behave like on master, where 
we generate a history and instantiate the PG with the latest map.  
Children due to intervening splits aren't created, so the user might have 
to recreate the pool.  The downside is that this (loading maps and 
generating history) happens holding the shard lock.

Option F: make the mon continue to send the legacy create messages for the 
time being, which still do what master does.

That lets us kick the can down the road a bit further to the pg merging 
branch, which probably needs something like D anyway.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux