On Mon, 27 Aug 2018, Yingxin Cheng wrote: > On Sat, Aug 25, 2018 at 11:23 AM Haomai Wang <haomai@xxxxxxxx> wrote: > > > > I have another idea which could avoid must-cross-core way. When osd > > booting or startup, osd has three shards to map PGs, like > > shard(x)=hash(PG) % shards. Different shard bind different port, when > > client try to connect osd known from osdmap, we could add extra logic > > in messenger handshake to allow redirect to expected port. It means we > > allow message redispatch to another core, but it client can aware > > this, it can learn from osd and send message to expected port in the > > future. > > I agree this will implement a true run-to-completion seastar-osd, and > thus implicates the best performance. > But to reach this goal, there are some difficult parts IMHO: > * Maybe we still need to implement both per-client and per-shard > seastar-msgr for backward compatibility. > * Do we need to consider the situation when there are legacy clients > and new per-shard clients communicate with the same seastar-osd? > * Does each new per-shard seastar-osd need to maintain num_of_shards * > num_of_clients connections? What this really sounds like is an independent TCP connection to each core. Once you get to that point, you may as well just make an OSD per core and avoid any of the weird client-visible sharding... It seems to me like the right model won't be clear until it is actually working and we have a better handle on how much CPU we will be spending. If we can drive an NVMe with one core, that would be great, but that seems like a big leap from where we are now. And it will be pretty dependent on the workload. So either we (1) assume that we can always divvy up the storage in units of 1 core and there is no crossbar in the OSD, or (2) we assume there will be cases where multiple cores are applied to a single device and we have one. 1 doesn't seem realistic/likely to me... sage > > > I think 2 or 3 shards per osd is enough, so we may burst 2/3 times > > connection than before. even more, we can use nic rx rss feature to > > let kernel avoid cross core switch. > > Does "2 or 3 shards" mean 2~3 worker-threads per seastar-osd? I > remember the default setting of "osd_op_num_shards_ssd" is 8 and > "osd_op_num_threads_per_shard_ssd" is 2, meaning 16 worker-threads in > the all-flash solution. > > > kefu chai <tchaikov@xxxxxxxxx> 于2018年8月24日周五 下午6:07写道: > > > > > > this is a summary of discussion on osd-seastar we had in a meeting in this week. > > > > > > seastar use share-nothing design to take the advantage of multi-core > > > hardware. but there are some inherent problems in OSD. in seastar-osd, > > > we will have a sharded osd service listening on given port on all > > > configured cores in parallel using SO_REUSEPORT, so the connections > > > are evenly distributed [0] across all seastar reactors. > > > > > > also, in seastar-osd, to shard PGs on different cores looks like an > > > intuitive design. for instance, we can > > > - ensure the order of osd op to maintain a pglog > > > - have better control of the io queue-depth of the storage device > > > - maintain a consistent state without extra "locking" of the > > > underlying ObjectStore and PG instances. > > > > > > but we cannot enforce a client to send requests to a single PG, or the > > > PGs which happen to be hosted by the core which accepts the connection > > > from this client. so i think we can only have a run-to-completion > > > session for a request chain which is targeting a certain PG, and > > > forward the client to whichever the PG it wants to talk to. this > > > cross-core communication is inevitable, i think. > > > > > > to avoid starving low traffic connection by high traffic client on a > > > certain core, we use the `Throttle` attached to each connection. see > > > SocketConnection::maybe_throttle(). > > > > > > --- > > > > > > [0] https://lwn.net/Articles/542629/ > > > -- > > > Regards > > > Kefu Chai > > > > -- > Regards, > Yingxin > >