On Mon, Aug 27, 2018 at 1:05 PM Sage Weil <sage@xxxxxxxxxxxx> wrote: > [snip] > What this really sounds like is an independent TCP connection to each > core. Once you get to that point, you may as well just make an OSD per > core and avoid any of the weird client-visible sharding... > > It seems to me like the right model won't be clear until it is actually > working and we have a better handle on how much CPU we will be spending. > If we can drive an NVMe with one core, that would be great, but that seems > like a big leap from where we are now. And it will be pretty dependent > on the workload. > > So either we (1) assume that we can always divvy up the storage in units > of 1 core and there is no crossbar in the OSD, or (2) we assume there will > be cases where multiple cores are applied to a single device and we have > one. 1 doesn't seem realistic/likely to me... Not sure if I get the points (or repeat the points). While OSD is still a "per-device" concept, it seems to me that we need always conform multi-core OSDs, and thus not possible to make an OSD per core. I think the real difference is to make a PG-shard per core. With seastar framework it's easier to implement user-space coroutines within each core (i.e. async and lockless), and core-local memory managements (i.e. shared-nothing). And there can be at least two design choices: a) Continue to use per-client-conn msgr design and then submit to PG-sharded core in seastar-osd; b) Assign or shard TCP connection to PG-sharded cores, this requires to implement a new pg-sharded-conn msgr design (a big leap here); > sage > > > > > > > I think 2 or 3 shards per osd is enough, so we may burst 2/3 times > > > connection than before. even more, we can use nic rx rss feature to > > > let kernel avoid cross core switch. > > > > Does "2 or 3 shards" mean 2~3 worker-threads per seastar-osd? I > > remember the default setting of "osd_op_num_shards_ssd" is 8 and > > "osd_op_num_threads_per_shard_ssd" is 2, meaning 16 worker-threads in > > the all-flash solution. > > > > > kefu chai <tchaikov@xxxxxxxxx> 于2018年8月24日周五 下午6:07写道: > > > > > > > > this is a summary of discussion on osd-seastar we had in a meeting in this week. > > > > > > > > seastar use share-nothing design to take the advantage of multi-core > > > > hardware. but there are some inherent problems in OSD. in seastar-osd, > > > > we will have a sharded osd service listening on given port on all > > > > configured cores in parallel using SO_REUSEPORT, so the connections > > > > are evenly distributed [0] across all seastar reactors. > > > > > > > > also, in seastar-osd, to shard PGs on different cores looks like an > > > > intuitive design. for instance, we can > > > > - ensure the order of osd op to maintain a pglog > > > > - have better control of the io queue-depth of the storage device > > > > - maintain a consistent state without extra "locking" of the > > > > underlying ObjectStore and PG instances. > > > > > > > > but we cannot enforce a client to send requests to a single PG, or the > > > > PGs which happen to be hosted by the core which accepts the connection > > > > from this client. so i think we can only have a run-to-completion > > > > session for a request chain which is targeting a certain PG, and > > > > forward the client to whichever the PG it wants to talk to. this > > > > cross-core communication is inevitable, i think. > > > > > > > > to avoid starving low traffic connection by high traffic client on a > > > > certain core, we use the `Throttle` attached to each connection. see > > > > SocketConnection::maybe_throttle(). > > > > > > > > --- > > > > > > > > [0] https://lwn.net/Articles/542629/ > > > > -- > > > > Regards > > > > Kefu Chai > > > > > > > > -- > > Regards, > > Yingxin > > > > -- Regards, Yingxin