as you might know, seastar encourage a share-nothing programming paradigm. as in previous discussions we found that there are always some cross-core communications in the sharded seastar-osd, because there are couple infrastructures could be shared by a sharded OSD, namely: - osdmap cache - connection to peer OSDs, and heartbeats with them - connection to monitor and mgr, and beacon/reports to them - i/o to the underlying objectstore recently, when we are working on cross-core messenger[0], we found that, in order to share the connection between cores we need to have types like "seastar::lw_shared_ptr<seastar::foreign_ptr<ConnectionRef>>", because - the connections to peer OSDs are shared across cores, - the connections are shared by multiple continuations on the local core -- either locally or remotely. and we need to perform i/o on the core where the connection is established. personally, i feel that it's a bad smell, as it's complicated and always involves cross-core communications. Radoslaw suggested an alternative: single-threaded OSD which pushes the share-nothing design to another level. in this design, just like the existing model, an OSD host will still have multiple instances of OSD, but each instance of OSD will be running on and only on its own designated core. nothing will be shared across these OSD instances. so we can still benefit from Seastar, and at the same time, won't be worried about the complexities and performance degradation due to cross-core communications. this design resembles the co-located OSD design we were talking about in the sense that all OSDs will reside in the same process. but they are different in that, it enforces strict share-nothing model. but on the other side, single-threaded OSD has following restrictions/assumptions: - 1-to-1 mapping from core to OSD. some of the following questions also apply to the NIC. * Mark worried that what if we have more stores than cores. or more cores than disks? how can we do the mapping. probably to group disks into an LVM? but that would increase the load of the core which gets mapped to that LVM volume. which causes imbalance of the load, i think. * how about more cores than stores? * how to shard a high throughput storage device? for instance, to take the full advantage of an NVMe storage device, we might need to drive it with 4 or more cores. but how to do it? can we leverage virtualization techniques like SPDK-vhost, SPDK-Blobstore? for a device supporting SR-IOV, it'd be probably simpler. - unable to share the osdmap cache. considering a high-density storage deployment, where more than 40 disks are squeezed into a host, if we are not able to reuse the osdmap cache. that's a shame.. - unable to share the connection to peer OSDs, mon and mgr. probably it's not a big deal in comparison to existing non co-located OSD, but if we compare it with the co-located OSD, well, you'll see what we will be missing. we had some discussions on this topic recently on the crimson standup and on the perf meeting. but i feel that the only consensus we reached is that it's difficult to tell which way to go -- 1-1 mapping or m-n mapping. what i can think of is to avoid making the decision now, and instead to encapsulate the difference between these two approach as much as possible in smaller scopes. for instance, to hide the difference between a shared messenger and a non-shared messenger in the messenger's implementation itself, and provide a consistent API to the caller/dispatcher. so we can switch over to the single-threaded OSD in future if necessary with less pain. but i admit that it does not address the complexity of the pain of "seastar::lw_shared_ptr<seastar::foreign_ptr<ConnectionRef>>". =) thoughts? --- [0] https://github.com/ceph/ceph/pull/24945 -- Regards Kefu Chai