Re: single-threaded seastar-osd

kefu chai <tchaikov@xxxxxxxxx> · Tue, 8 Jan 2019 21:23:22 +0800

On Tue, Jan 8, 2019 at 9:16 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Tue, 8 Jan 2019, kefu chai wrote:
> > On Sun, Jan 6, 2019 at 2:27 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> > >
> > > Hi Kefu,
> > >
> > > On Sat, Jan 5, 2019 at 6:42 AM kefu chai <tchaikov@xxxxxxxxx> wrote:
> > > >
> > > > as you might know, seastar encourage a share-nothing programming
> > > > paradigm. as in previous discussions we found that there are always
> > > > some cross-core communications in the sharded seastar-osd, because
> > > > there are couple infrastructures could be shared by a sharded OSD,
> > > > namely:
> > > >
> > > > - osdmap cache
> > > > - connection to peer OSDs, and heartbeats with them
> > > > - connection to monitor and mgr, and beacon/reports to them
> > > > - i/o to the underlying objectstore
> > > >
> > > > recently, when we are working on cross-core messenger[0], we found
> > > > that, in order to share the connection between cores we need to have
> > > > types like "seastar::lw_shared_ptr<seastar::foreign_ptr<ConnectionRef>>",
> > > > because
> > > > - the connections to peer OSDs are shared across cores,
> > > > - the connections are shared by multiple continuations on the local
> > > > core -- either locally or remotely.
> > >
> > > I'm not up-to-speed on all the proposals here but it seems to me the
> > > simple solution for the peer OSD connection is to have multiple
> > > connections between peer OSDs (one for each PG?). Is that not feasible
> > > for some reason?
> >
> > i guess you meant one for each peer OSD. yeah, that's the 1-to-1
> > mapping proposal where the TCP connections are not shared.
>
> I think Patrick's point is a bit different.
>
> The idea with the inter-OSD connections in general is that in the limit as
> your cluster size grows you still have a bounded number of peers
> (acting.size() * num pgs).  In that case, it would make sense for the PGs
> themselves to own the OSD connections themselves instead of the current
> code where messages are sent and dispatched via somewhat complicated code
> in OSD.cc.  The user-visible change would be that even for small clusters
> you'd have lots of connections.
>
> To accomplish this, all sorts of things would need to change, though:
>
> - connections would be established and then ownership passed on the PG's
> based on some message from the originator.  the connection race complexity
> would get even weirder (probably resolved at the PG layer instead of in
> the messenger itself?)
>
> - the dispatch architecture would/coud totally change (again) since we'd
> know what PG a message is based on the connection, without even looking at
> the message.
>
> - all of the cases where we aggregate messages across PGs would go away.
> there aren't many of these left, though, so that's probably fine.
>
> - osdmap sharing would get a bit weirder than it already is.
>
> I don't think seastar would change the above too significantly as long as
> it allows you to accept a connection on one core and then pass ownership
> off to another.  But from Kefu's other reply on this thread that
> sounds problematic.  Perhaps the OSDMap could have endpoints for each
> of the N cores the OSD is using so that connections come in on the right
> one as a workaround?
>
> In order to get to this model, though, there is a ton of other work needed
> in RADOS itself (the protocol, not the implementation), so I wouldn't want
> to go down this road unless we're pretty confident it's going to help.
>

oh, i see. it reminds me what Haomai suggested when we were discussing
the cross core communication: he suggested that, to avoid the cross
core communication, one alternative is to attach different PGs to
different ports, and listen on them instead of listening on a single
public port. yeah, that's very big change.

> sage

-- 
Regards
Kefu Chai