Re: cross-core communications in seastar-osd

Yingxin Cheng <yingxincheng@xxxxxxxxx> · Mon, 27 Aug 2018 10:57:47 +0800

On Sat, Aug 25, 2018 at 11:23 AM Haomai Wang <haomai@xxxxxxxx> wrote:
>
> I have another idea which could avoid must-cross-core way. When osd
> booting or startup, osd has three shards to map PGs, like
> shard(x)=hash(PG) % shards. Different shard bind different port, when
> client try to connect osd known from osdmap, we could add extra logic
> in messenger handshake to allow redirect to expected port. It means we
> allow message redispatch to another core, but it client can aware
> this, it can learn from osd and send message to expected port in the
> future.

I agree this will implement a true run-to-completion seastar-osd, and
thus implicates the best performance.
But to reach this goal, there are some difficult parts IMHO:
* Maybe we still need to implement both per-client and per-shard
seastar-msgr for backward compatibility.
* Do we need to consider the situation when there are legacy clients
and new per-shard clients communicate with the same seastar-osd?
* Does each new per-shard seastar-osd need to maintain num_of_shards *
num_of_clients connections?

> I think 2 or 3 shards per osd is enough, so we may burst 2/3 times
> connection than before. even more, we can use nic rx rss feature to
> let kernel avoid cross core switch.

Does "2 or 3 shards" mean 2~3 worker-threads per seastar-osd? I
remember the default setting of "osd_op_num_shards_ssd" is 8 and
"osd_op_num_threads_per_shard_ssd" is 2, meaning 16 worker-threads in
the all-flash solution.

> kefu chai <tchaikov@xxxxxxxxx> 于2018年8月24日周五 下午6:07写道：
> >
> > this is a summary of discussion on osd-seastar we had in a meeting in this week.
> >
> > seastar use share-nothing design to take the advantage of multi-core
> > hardware. but there are some inherent problems in OSD. in seastar-osd,
> > we will have a sharded osd service listening on given port on all
> > configured cores in parallel using SO_REUSEPORT, so the connections
> > are evenly distributed [0] across all seastar reactors.
> >
> > also, in seastar-osd, to shard PGs on different cores looks like an
> > intuitive design. for instance, we can
> > - ensure the order of osd op to maintain a pglog
> > - have better control of the io queue-depth of the storage device
> > - maintain a consistent state without extra "locking" of the
> > underlying ObjectStore and PG instances.
> >
> > but we cannot enforce a client to send requests to a single PG, or the
> > PGs which happen to be hosted by the core which accepts the connection
> > from this client. so i think we can only have a run-to-completion
> > session for a request chain which is targeting a certain PG, and
> > forward the client to whichever the PG it wants to talk to. this
> > cross-core communication is inevitable, i think.
> >
> > to avoid starving low traffic connection by high traffic client on a
> > certain core, we use the `Throttle` attached to each connection. see
> > SocketConnection::maybe_throttle().
> >
> > ---
> >
> > [0] https://lwn.net/Articles/542629/
> > --
> > Regards
> > Kefu Chai

-- 
Regards,
Yingxin