On Sat, Aug 25, 2018 at 11:23 AM Haomai Wang <haomai@xxxxxxxx> wrote: > > I have another idea which could avoid must-cross-core way. When osd > booting or startup, osd has three shards to map PGs, like > shard(x)=hash(PG) % shards. Different shard bind different port, when > client try to connect osd known from osdmap, we could add extra logic > in messenger handshake to allow redirect to expected port. It means we > allow message redispatch to another core, but it client can aware > this, it can learn from osd and send message to expected port in the > future. I agree this will implement a true run-to-completion seastar-osd, and thus implicates the best performance. But to reach this goal, there are some difficult parts IMHO: * Maybe we still need to implement both per-client and per-shard seastar-msgr for backward compatibility. * Do we need to consider the situation when there are legacy clients and new per-shard clients communicate with the same seastar-osd? * Does each new per-shard seastar-osd need to maintain num_of_shards * num_of_clients connections? > I think 2 or 3 shards per osd is enough, so we may burst 2/3 times > connection than before. even more, we can use nic rx rss feature to > let kernel avoid cross core switch. Does "2 or 3 shards" mean 2~3 worker-threads per seastar-osd? I remember the default setting of "osd_op_num_shards_ssd" is 8 and "osd_op_num_threads_per_shard_ssd" is 2, meaning 16 worker-threads in the all-flash solution. > kefu chai <tchaikov@xxxxxxxxx> 于2018年8月24日周五 下午6:07写道: > > > > this is a summary of discussion on osd-seastar we had in a meeting in this week. > > > > seastar use share-nothing design to take the advantage of multi-core > > hardware. but there are some inherent problems in OSD. in seastar-osd, > > we will have a sharded osd service listening on given port on all > > configured cores in parallel using SO_REUSEPORT, so the connections > > are evenly distributed [0] across all seastar reactors. > > > > also, in seastar-osd, to shard PGs on different cores looks like an > > intuitive design. for instance, we can > > - ensure the order of osd op to maintain a pglog > > - have better control of the io queue-depth of the storage device > > - maintain a consistent state without extra "locking" of the > > underlying ObjectStore and PG instances. > > > > but we cannot enforce a client to send requests to a single PG, or the > > PGs which happen to be hosted by the core which accepts the connection > > from this client. so i think we can only have a run-to-completion > > session for a request chain which is targeting a certain PG, and > > forward the client to whichever the PG it wants to talk to. this > > cross-core communication is inevitable, i think. > > > > to avoid starving low traffic connection by high traffic client on a > > certain core, we use the `Throttle` attached to each connection. see > > SocketConnection::maybe_throttle(). > > > > --- > > > > [0] https://lwn.net/Articles/542629/ > > -- > > Regards > > Kefu Chai -- Regards, Yingxin