Re: single-threaded seastar-osd

Sage Weil <sweil@xxxxxxxxxx> · Tue, 8 Jan 2019 13:16:51 +0000 (UTC)

On Tue, 8 Jan 2019, kefu chai wrote:
> On Sun, Jan 6, 2019 at 2:27 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> >
> > Hi Kefu,
> >
> > On Sat, Jan 5, 2019 at 6:42 AM kefu chai <tchaikov@xxxxxxxxx> wrote:
> > >
> > > as you might know, seastar encourage a share-nothing programming
> > > paradigm. as in previous discussions we found that there are always
> > > some cross-core communications in the sharded seastar-osd, because
> > > there are couple infrastructures could be shared by a sharded OSD,
> > > namely:
> > >
> > > - osdmap cache
> > > - connection to peer OSDs, and heartbeats with them
> > > - connection to monitor and mgr, and beacon/reports to them
> > > - i/o to the underlying objectstore
> > >
> > > recently, when we are working on cross-core messenger[0], we found
> > > that, in order to share the connection between cores we need to have
> > > types like "seastar::lw_shared_ptr<seastar::foreign_ptr<ConnectionRef>>",
> > > because
> > > - the connections to peer OSDs are shared across cores,
> > > - the connections are shared by multiple continuations on the local
> > > core -- either locally or remotely.
> >
> > I'm not up-to-speed on all the proposals here but it seems to me the
> > simple solution for the peer OSD connection is to have multiple
> > connections between peer OSDs (one for each PG?). Is that not feasible
> > for some reason?
> 
> i guess you meant one for each peer OSD. yeah, that's the 1-to-1
> mapping proposal where the TCP connections are not shared.

I think Patrick's point is a bit different.

The idea with the inter-OSD connections in general is that in the limit as 
your cluster size grows you still have a bounded number of peers 
(acting.size() * num pgs).  In that case, it would make sense for the PGs 
themselves to own the OSD connections themselves instead of the current 
code where messages are sent and dispatched via somewhat complicated code 
in OSD.cc.  The user-visible change would be that even for small clusters 
you'd have lots of connections.

To accomplish this, all sorts of things would need to change, though:

- connections would be established and then ownership passed on the PG's 
based on some message from the originator.  the connection race complexity 
would get even weirder (probably resolved at the PG layer instead of in 
the messenger itself?)

- the dispatch architecture would/coud totally change (again) since we'd 
know what PG a message is based on the connection, without even looking at 
the message.

- all of the cases where we aggregate messages across PGs would go away.  
there aren't many of these left, though, so that's probably fine.

- osdmap sharing would get a bit weirder than it already is.

I don't think seastar would change the above too significantly as long as 
it allows you to accept a connection on one core and then pass ownership 
off to another.  But from Kefu's other reply on this thread that 
sounds problematic.  Perhaps the OSDMap could have endpoints for each 
of the N cores the OSD is using so that connections come in on the right 
one as a workaround?

In order to get to this model, though, there is a ton of other work needed 
in RADOS itself (the protocol, not the implementation), so I wouldn't want 
to go down this road unless we're pretty confident it's going to help.

sage