Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Michael Galaxy <mgalaxy@xxxxxxxxxx> · Sat, 28 Sep 2024 12:52:08 -0500

On 9/27/24 16:45, Sean Hefty wrote:
!-------------------------------------------------------------------|
   This Message Is From an External Sender
   This message came from outside your organization.
|-------------------------------------------------------------------!

I have met with the team from IONOS about their testing on actual IB
hardware here at KVM Forum today and the requirements are starting
to make more sense to me. I didn't say much in our previous thread
because I misunderstood the requirements, so let me try to explain
and see if we're all on the same page. There appears to be a
fundamental limitation here with rsocket, for which I don't see how it is
possible to overcome.
The basic problem is that rsocket is trying to present a stream
abstraction, a concept that is fundamentally incompatible with RDMA.
The whole point of using RDMA in the first place is to avoid using
the CPU, and to do that, all of the memory (potentially hundreds of
gigabytes) need to be registered with the hardware *in advance* (this is
how the original implementation works).
The need to fake a socket/bytestream abstraction eventually breaks
down => There is a limit (a few GB) in rsocket (which the IONOS team
previous reported in testing.... see that email), it appears that
means that rsocket is only going to be able to map a certain limited
amount of memory with the hardware until its internal "buffer" runs
out before it can then unmap and remap the next batch of memory with
the hardware to continue along with the fake bytestream. This is
very much sticking a square peg in a round hole. If you were to
"relax" the rsocket implementation to register the entire VM memory
space (as my original implementation does), then there wouldn't be any
need for rsocket in the first place.

Yes, some test like this can be helpful.

And thanks for the summary.  That's definitely helpful.

One question from my side (as someone knows nothing on RDMA/rsocket): is
that "a few GBs" limitation a software guard?  Would it be possible that rsocket
provide some option to allow user opt-in on setting that value, so that it might
work for VM use case?  Would that consume similar resources v.s. the current
QEMU impl but allows it to use rsockets with no perf regressions?
Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.

This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.
Understood.
Does your kernel allocate > 4 GBs of buffer space to an individual socket?
Yes, it absolutely does. We're dealing with virtual machines here, 
right? It is possible (and likely) to have a virtual machine that is 
hundreds of GBs of RAM in size.

A bounce buffer defeats the entire purpose of using RDMA in these cases. 
When using RDMA for very large transfers like this, the goal here is to 
map the entire memory region at once and avoid all CPU interactions 
(except for message management within libibverbs) so that the NIC is 
doing all of the work.

I'm sure rsocket has its place with much smaller transfer sizes, but 
this is very different.

- Michael