Re: creating pools/pgs vs split

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 5 Apr 2018, Sage Weil wrote:
> On Thu, 5 Apr 2018, Gregory Farnum wrote:
> > On Thu, Apr 5, 2018 at 6:59 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > > I'm stuck on an annoying issue with wip-peering-fast-dispatch.  There is a
> > > race between handling of pg_create messages from the mon (pool creation)
> > > and pg splits.  It goes like this:
> > >
> > > - pool is created in epoch E1
> > > - pg A is created somewhere
> > > - it creates on an osd, peers, is happy, but does not report back to mon
> > > that create succeeded.
> > > - pool pg_num increases, pg A splits into A+B in E2
> > > - layout changes, time passes, etc.  mon still doesn't realize A is
> > > created.
> > > - on some other osd, B is created and instantiated in E3
> > > - mon sends create for A to same osd

A missing piece of the explanation is what is different between 
wip-peering-fast-dispatch and master.  It turns out master is also broken, 
but in a different way:

http://tracker.ceph.com/issues/22165

There, we create A with E3, and a pre-generated history.  This leads 
to a different possible output that child B is never created in a 
different sequence of events:

- pg A was never created yet
- osd went down
- split
- osd comes up
- osd gets pg_create on A
- does not process a split, does not get pg_create for B

so the pool create basically never finishes.  The user has to delete the 
pool and try again.

There is also an unrelated bug in master where the mon is told the pg is 
created when the primary osd queues the create, which means it could fail 
and the PG create doesn't get retried (and is never finished).  The 
wip-peering-fast-dispatch fixes this by only acking the create after the 
pg has activated.  (This is probably partly why we see the original 
scenario: we are more careful to tell the mon the pg is created, which 
means that resent pg_create messages are more common.)

wip-peering-fast-dispatch has a different PG creation approach: we 
instantiate the pg (A) in the original pool creation epoch, and let it 
roll forward through maps and peering and everything else.  Peering is 
already very robust so this captures all of the splits and prior osds and 
so on... it just doesn't expect a split child to already exist.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux