Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Michael Galaxy <mgalaxy@xxxxxxxxxx> · Fri, 27 Sep 2024 15:34:48 -0500

Hi Gonglei,

On 9/22/24 20:04, Gonglei (Arei) wrote:
!-------------------------------------------------------------------|
   This Message Is From an External Sender
   This message came from outside your organization.
|-------------------------------------------------------------------!

Hi,

-----Original Message-----
From: Michael Galaxy [mailto:mgalaxy@xxxxxxxxxx]
Sent: Monday, September 23, 2024 3:29 AM
To: Michael S. Tsirkin <mst@xxxxxxxxxx>; Peter Xu <peterx@xxxxxxxxxx>
Cc: Gonglei (Arei) <arei.gonglei@xxxxxxxxxx>; qemu-devel@xxxxxxxxxx;
yu.zhang@xxxxxxxxx; elmar.gerdes@xxxxxxxxx; zhengchuan
<zhengchuan@xxxxxxxxxx>; berrange@xxxxxxxxxx; armbru@xxxxxxxxxx;
lizhijian@xxxxxxxxxxx; pbonzini@xxxxxxxxxx; Xiexiangyou
<xiexiangyou@xxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx; lixiao (H)
<lixiao91@xxxxxxxxxx>; jinpu.wang@xxxxxxxxx; Wangjialin
<wangjialin23@xxxxxxxxxx>
Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Hi All,

I have met with the team from IONOS about their testing on actual IB
hardware here at KVM Forum today and the requirements are starting to make
more sense to me. I didn't say much in our previous thread because I
misunderstood the requirements, so let me try to explain and see if we're all on
the same page. There appears to be a fundamental limitation here with rsocket,
for which I don't see how it is possible to overcome.

The basic problem is that rsocket is trying to present a stream abstraction, a
concept that is fundamentally incompatible with RDMA. The whole point of
using RDMA in the first place is to avoid using the CPU, and to do that, all of the
memory (potentially hundreds of gigabytes) need to be registered with the
hardware *in advance* (this is how the original implementation works).

The need to fake a socket/bytestream abstraction eventually breaks down =>
There is a limit (a few GB) in rsocket (which the IONOS team previous reported
in testing.... see that email), it appears that means that rsocket is only going to
be able to map a certain limited amount of memory with the hardware until its
internal "buffer" runs out before it can then unmap and remap the next batch
of memory with the hardware to continue along with the fake bytestream. This
is very much sticking a square peg in a round hole. If you were to "relax" the
rsocket implementation to register the entire VM memory space (as my
original implementation does), then there wouldn't be any need for rsocket in
the first place.

Thank you for your opinion. You're right. RSocket has encountered difficulties in
transferring large amounts of data. We haven't even figured it out yet. Although
in this practice, we solved several problems with rsocket.

In our practice, we need to quickly complete VM live migration and the downtime
of live migration must be within 50 ms or less. Therefore, we use RDMA, which is
an essential requirement. Next, I think we'll do it based on Qemu's native RDMA
live migration solution. During this period, we really doubted whether RDMA live
migration was really feasible through rsocket refactoring, so the refactoring plan
was shelved.

Regards,
-Gonglei

OK, this is helpful. Thanks for the response.

So that means we do still have two consumers of the native libibverbs 
RDMA solution.

Comments are still welcome. Is there still a reason to pursue this line 
of work that I might be missing?

- Michael