Re: single-threaded seastar-osd

Sage Weil <sweil@xxxxxxxxxx> · Tue, 8 Jan 2019 13:30:26 +0000 (UTC)

On Tue, 8 Jan 2019, kefu chai wrote:
> On Tue, Jan 8, 2019 at 9:16 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
> >
> > On Tue, 8 Jan 2019, kefu chai wrote:
> > > On Sun, Jan 6, 2019 at 2:27 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> > > >
> > > > Hi Kefu,
> > > >
> > > > On Sat, Jan 5, 2019 at 6:42 AM kefu chai <tchaikov@xxxxxxxxx> wrote:
> > > > >
> > > > > as you might know, seastar encourage a share-nothing programming
> > > > > paradigm. as in previous discussions we found that there are always
> > > > > some cross-core communications in the sharded seastar-osd, because
> > > > > there are couple infrastructures could be shared by a sharded OSD,
> > > > > namely:
> > > > >
> > > > > - osdmap cache
> > > > > - connection to peer OSDs, and heartbeats with them
> > > > > - connection to monitor and mgr, and beacon/reports to them
> > > > > - i/o to the underlying objectstore
> > > > >
> > > > > recently, when we are working on cross-core messenger[0], we found
> > > > > that, in order to share the connection between cores we need to have
> > > > > types like "seastar::lw_shared_ptr<seastar::foreign_ptr<ConnectionRef>>",
> > > > > because
> > > > > - the connections to peer OSDs are shared across cores,
> > > > > - the connections are shared by multiple continuations on the local
> > > > > core -- either locally or remotely.
> > > >
> > > > I'm not up-to-speed on all the proposals here but it seems to me the
> > > > simple solution for the peer OSD connection is to have multiple
> > > > connections between peer OSDs (one for each PG?). Is that not feasible
> > > > for some reason?
> > >
> > > i guess you meant one for each peer OSD. yeah, that's the 1-to-1
> > > mapping proposal where the TCP connections are not shared.
> >
> > I think Patrick's point is a bit different.
> >
> > The idea with the inter-OSD connections in general is that in the limit as
> > your cluster size grows you still have a bounded number of peers
> > (acting.size() * num pgs).  In that case, it would make sense for the PGs
> > themselves to own the OSD connections themselves instead of the current
> > code where messages are sent and dispatched via somewhat complicated code
> > in OSD.cc.  The user-visible change would be that even for small clusters
> > you'd have lots of connections.
> >
> > To accomplish this, all sorts of things would need to change, though:
> >
> > - connections would be established and then ownership passed on the PG's
> > based on some message from the originator.  the connection race complexity
> > would get even weirder (probably resolved at the PG layer instead of in
> > the messenger itself?)
> >
> > - the dispatch architecture would/coud totally change (again) since we'd
> > know what PG a message is based on the connection, without even looking at
> > the message.
> >
> > - all of the cases where we aggregate messages across PGs would go away.
> > there aren't many of these left, though, so that's probably fine.
> >
> > - osdmap sharing would get a bit weirder than it already is.
> >
> > I don't think seastar would change the above too significantly as long as
> > it allows you to accept a connection on one core and then pass ownership
> > off to another.  But from Kefu's other reply on this thread that
> > sounds problematic.  Perhaps the OSDMap could have endpoints for each
> > of the N cores the OSD is using so that connections come in on the right
> > one as a workaround?
> >
> > In order to get to this model, though, there is a ton of other work needed
> > in RADOS itself (the protocol, not the implementation), so I wouldn't want
> > to go down this road unless we're pretty confident it's going to help.
> >
> 
> oh, i see. it reminds me what Haomai suggested when we were discussing
> the cross core communication: he suggested that, to avoid the cross
> core communication, one alternative is to attach different PGs to
> different ports, and listen on them instead of listening on a single
> public port. yeah, that's very big change.

It doesn't have any obvious problems (aside from being quite a bit of 
work).  Perhaps this is part of the escape route if we go down the 
confine-osd-to-single-core path and find that we need multiple cores for 
some storage devices.

I'm worried that the bigger issue, though, will be making the ObjectStore 
implementation handle incoming transations from multiple cores.  None of 
the SeaStore handwaving so far on this point has been super comforting.  
And even in an initial memstore-like implementation this issue immediately 
comes up.

sage