Re: cross-core communications in seastar-osd

Yingxin Cheng <yingxincheng@xxxxxxxxx> · Mon, 27 Aug 2018 14:30:47 +0800

On Mon, Aug 27, 2018 at 1:05 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> [snip]
> What this really sounds like is an independent TCP connection to each
> core.  Once you get to that point, you may as well just make an OSD per
> core and avoid any of the weird client-visible sharding...
>
> It seems to me like the right model won't be clear until it is actually
> working and we have a better handle on how much CPU we will be spending.
> If we can drive an NVMe with one core, that would be great, but that seems
> like a big leap from where we are now.  And it will be pretty dependent
> on the workload.
>
> So either we (1) assume that we can always divvy up the storage in units
> of 1 core and there is no crossbar in the OSD, or (2) we assume there will
> be cases where multiple cores are applied to a single device and we have
> one.  1 doesn't seem realistic/likely to me...

Not sure if I get the points (or repeat the points). While OSD is
still a "per-device" concept, it seems to me that we need always
conform multi-core OSDs, and thus not possible to make an OSD per
core.

I think the real difference is to make a PG-shard per core.
With seastar framework it's easier to implement user-space coroutines
within each core (i.e. async and lockless), and core-local memory
managements (i.e. shared-nothing).
And there can be at least two design choices:
a) Continue to use per-client-conn msgr design and then submit to
PG-sharded core in seastar-osd;
b) Assign or shard TCP connection to PG-sharded cores, this requires
to implement a new pg-sharded-conn msgr design (a big leap here);

> sage
>
>
> >
> > > I think 2 or 3 shards per osd is enough, so we may burst 2/3 times
> > > connection than before. even more, we can use nic rx rss feature to
> > > let kernel avoid cross core switch.
> >
> > Does "2 or 3 shards" mean 2~3 worker-threads per seastar-osd? I
> > remember the default setting of "osd_op_num_shards_ssd" is 8 and
> > "osd_op_num_threads_per_shard_ssd" is 2, meaning 16 worker-threads in
> > the all-flash solution.
> >
> > > kefu chai <tchaikov@xxxxxxxxx> 于2018年8月24日周五 下午6:07写道：
> > > >
> > > > this is a summary of discussion on osd-seastar we had in a meeting in this week.
> > > >
> > > > seastar use share-nothing design to take the advantage of multi-core
> > > > hardware. but there are some inherent problems in OSD. in seastar-osd,
> > > > we will have a sharded osd service listening on given port on all
> > > > configured cores in parallel using SO_REUSEPORT, so the connections
> > > > are evenly distributed [0] across all seastar reactors.
> > > >
> > > > also, in seastar-osd, to shard PGs on different cores looks like an
> > > > intuitive design. for instance, we can
> > > > - ensure the order of osd op to maintain a pglog
> > > > - have better control of the io queue-depth of the storage device
> > > > - maintain a consistent state without extra "locking" of the
> > > > underlying ObjectStore and PG instances.
> > > >
> > > > but we cannot enforce a client to send requests to a single PG, or the
> > > > PGs which happen to be hosted by the core which accepts the connection
> > > > from this client. so i think we can only have a run-to-completion
> > > > session for a request chain which is targeting a certain PG, and
> > > > forward the client to whichever the PG it wants to talk to. this
> > > > cross-core communication is inevitable, i think.
> > > >
> > > > to avoid starving low traffic connection by high traffic client on a
> > > > certain core, we use the `Throttle` attached to each connection. see
> > > > SocketConnection::maybe_throttle().
> > > >
> > > > ---
> > > >
> > > > [0] https://lwn.net/Articles/542629/
> > > > --
> > > > Regards
> > > > Kefu Chai
> >
> >
> >
> > --
> > Regards,
> > Yingxin
> >
> >

-- 
Regards,
Yingxin