On Tue, Jan 8, 2019 at 9:16 PM Sage Weil <sweil@xxxxxxxxxx> wrote: > > On Tue, 8 Jan 2019, kefu chai wrote: > > On Sun, Jan 6, 2019 at 2:27 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > > > > > > Hi Kefu, > > > > > > On Sat, Jan 5, 2019 at 6:42 AM kefu chai <tchaikov@xxxxxxxxx> wrote: > > > > > > > > as you might know, seastar encourage a share-nothing programming > > > > paradigm. as in previous discussions we found that there are always > > > > some cross-core communications in the sharded seastar-osd, because > > > > there are couple infrastructures could be shared by a sharded OSD, > > > > namely: > > > > > > > > - osdmap cache > > > > - connection to peer OSDs, and heartbeats with them > > > > - connection to monitor and mgr, and beacon/reports to them > > > > - i/o to the underlying objectstore > > > > > > > > recently, when we are working on cross-core messenger[0], we found > > > > that, in order to share the connection between cores we need to have > > > > types like "seastar::lw_shared_ptr<seastar::foreign_ptr<ConnectionRef>>", > > > > because > > > > - the connections to peer OSDs are shared across cores, > > > > - the connections are shared by multiple continuations on the local > > > > core -- either locally or remotely. > > > > > > I'm not up-to-speed on all the proposals here but it seems to me the > > > simple solution for the peer OSD connection is to have multiple > > > connections between peer OSDs (one for each PG?). Is that not feasible > > > for some reason? > > > > i guess you meant one for each peer OSD. yeah, that's the 1-to-1 > > mapping proposal where the TCP connections are not shared. > > I think Patrick's point is a bit different. > > The idea with the inter-OSD connections in general is that in the limit as > your cluster size grows you still have a bounded number of peers > (acting.size() * num pgs). In that case, it would make sense for the PGs > themselves to own the OSD connections themselves instead of the current > code where messages are sent and dispatched via somewhat complicated code > in OSD.cc. The user-visible change would be that even for small clusters > you'd have lots of connections. > > To accomplish this, all sorts of things would need to change, though: > > - connections would be established and then ownership passed on the PG's > based on some message from the originator. the connection race complexity > would get even weirder (probably resolved at the PG layer instead of in > the messenger itself?) > > - the dispatch architecture would/coud totally change (again) since we'd > know what PG a message is based on the connection, without even looking at > the message. > > - all of the cases where we aggregate messages across PGs would go away. > there aren't many of these left, though, so that's probably fine. > > - osdmap sharing would get a bit weirder than it already is. > > I don't think seastar would change the above too significantly as long as > it allows you to accept a connection on one core and then pass ownership > off to another. But from Kefu's other reply on this thread that > sounds problematic. Perhaps the OSDMap could have endpoints for each > of the N cores the OSD is using so that connections come in on the right > one as a workaround? > > In order to get to this model, though, there is a ton of other work needed > in RADOS itself (the protocol, not the implementation), so I wouldn't want > to go down this road unless we're pretty confident it's going to help. > oh, i see. it reminds me what Haomai suggested when we were discussing the cross core communication: he suggested that, to avoid the cross core communication, one alternative is to attach different PGs to different ports, and listen on them instead of listening on a single public port. yeah, that's very big change. > sage -- Regards Kefu Chai