On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote: > On 9/27/24 16:45, Sean Hefty wrote: > > !-------------------------------------------------------------------| > > This Message Is From an External Sender > > This message came from outside your organization. > > |-------------------------------------------------------------------! > > > > > > > I have met with the team from IONOS about their testing on actual IB > > > > > hardware here at KVM Forum today and the requirements are starting > > > > > to make more sense to me. I didn't say much in our previous thread > > > > > because I misunderstood the requirements, so let me try to explain > > > > > and see if we're all on the same page. There appears to be a > > > > > fundamental limitation here with rsocket, for which I don't see how it is > > > possible to overcome. > > > > > The basic problem is that rsocket is trying to present a stream > > > > > abstraction, a concept that is fundamentally incompatible with RDMA. > > > > > The whole point of using RDMA in the first place is to avoid using > > > > > the CPU, and to do that, all of the memory (potentially hundreds of > > > > > gigabytes) need to be registered with the hardware *in advance* (this is > > > how the original implementation works). > > > > > The need to fake a socket/bytestream abstraction eventually breaks > > > > > down => There is a limit (a few GB) in rsocket (which the IONOS team > > > > > previous reported in testing.... see that email), it appears that > > > > > means that rsocket is only going to be able to map a certain limited > > > > > amount of memory with the hardware until its internal "buffer" runs > > > > > out before it can then unmap and remap the next batch of memory with > > > > > the hardware to continue along with the fake bytestream. This is > > > > > very much sticking a square peg in a round hole. If you were to > > > > > "relax" the rsocket implementation to register the entire VM memory > > > > > space (as my original implementation does), then there wouldn't be any > > > need for rsocket in the first place. > > > > > > Yes, some test like this can be helpful. > > > > > > And thanks for the summary. That's definitely helpful. > > > > > > One question from my side (as someone knows nothing on RDMA/rsocket): is > > > that "a few GBs" limitation a software guard? Would it be possible that rsocket > > > provide some option to allow user opt-in on setting that value, so that it might > > > work for VM use case? Would that consume similar resources v.s. the current > > > QEMU impl but allows it to use rsockets with no perf regressions? > > Rsockets is emulated the streaming socket API. The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting. It is also configurable via rsetsockopt() SO_SNDBUF. Both of those are similar to TCP settings. The SW field used to store this value is 32-bits. > > > > This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers. Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass. > Understood. > > Does your kernel allocate > 4 GBs of buffer space to an individual socket? > Yes, it absolutely does. We're dealing with virtual machines here, right? It > is possible (and likely) to have a virtual machine that is hundreds of GBs > of RAM in size. > > A bounce buffer defeats the entire purpose of using RDMA in these cases. > When using RDMA for very large transfers like this, the goal here is to map > the entire memory region at once and avoid all CPU interactions (except for > message management within libibverbs) so that the NIC is doing all of the > work. > > I'm sure rsocket has its place with much smaller transfer sizes, but this is > very different. Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM use case? I also wonder whether there're other applications that may benefit from this outside of QEMU. Thanks, -- Peter Xu