RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API

"Gonglei (Arei)" <arei.gonglei@xxxxxxxxxx> · Mon, 23 Sep 2024 01:04:17 +0000

Hi,

> -----Original Message-----
> From: Michael Galaxy [mailto:mgalaxy@xxxxxxxxxx]
> Sent: Monday, September 23, 2024 3:29 AM
> To: Michael S. Tsirkin <mst@xxxxxxxxxx>; Peter Xu <peterx@xxxxxxxxxx>
> Cc: Gonglei (Arei) <arei.gonglei@xxxxxxxxxx>; qemu-devel@xxxxxxxxxx;
> yu.zhang@xxxxxxxxx; elmar.gerdes@xxxxxxxxx; zhengchuan
> <zhengchuan@xxxxxxxxxx>; berrange@xxxxxxxxxx; armbru@xxxxxxxxxx;
> lizhijian@xxxxxxxxxxx; pbonzini@xxxxxxxxxx; Xiexiangyou
> <xiexiangyou@xxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx; lixiao (H)
> <lixiao91@xxxxxxxxxx>; jinpu.wang@xxxxxxxxx; Wangjialin
> <wangjialin23@xxxxxxxxxx>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> Hi All,
> 
> I have met with the team from IONOS about their testing on actual IB
> hardware here at KVM Forum today and the requirements are starting to make
> more sense to me. I didn't say much in our previous thread because I
> misunderstood the requirements, so let me try to explain and see if we're all on
> the same page. There appears to be a fundamental limitation here with rsocket,
> for which I don't see how it is possible to overcome.
> 
> The basic problem is that rsocket is trying to present a stream abstraction, a
> concept that is fundamentally incompatible with RDMA. The whole point of
> using RDMA in the first place is to avoid using the CPU, and to do that, all of the
> memory (potentially hundreds of gigabytes) need to be registered with the
> hardware *in advance* (this is how the original implementation works).
> 
> The need to fake a socket/bytestream abstraction eventually breaks down =>
> There is a limit (a few GB) in rsocket (which the IONOS team previous reported
> in testing.... see that email), it appears that means that rsocket is only going to
> be able to map a certain limited amount of memory with the hardware until its
> internal "buffer" runs out before it can then unmap and remap the next batch
> of memory with the hardware to continue along with the fake bytestream. This
> is very much sticking a square peg in a round hole. If you were to "relax" the
> rsocket implementation to register the entire VM memory space (as my
> original implementation does), then there wouldn't be any need for rsocket in
> the first place.
> 

Thank you for your opinion. You're right. RSocket has encountered difficulties in 
transferring large amounts of data. We haven't even figured it out yet. Although
in this practice, we solved several problems with rsocket.

In our practice, we need to quickly complete VM live migration and the downtime 
of live migration must be within 50 ms or less. Therefore, we use RDMA, which is 
an essential requirement. Next, I think we'll do it based on Qemu's native RDMA 
live migration solution. During this period, we really doubted whether RDMA live 
migration was really feasible through rsocket refactoring, so the refactoring plan 
was shelved.

Regards,
-Gonglei

> I think there is just some misunderstanding here in the group in the way
> infiniband is intended to work. Does that make sense so far? I do understand
> the need for testing, but rsocket is simply not intended to be used for kind of
> massive bulk data transfer purposes that we're proposing using it here for,
> simply for the purposes of making our lives better in testing.
> 
> Regarding testing: During our previous thread earlier this summer, why did we
> not consider making a better integration test to solve the test burden problem?
> To explain better: If a new integration test were written for QEMU and
> submitted and reviewed (a reasonably complex test that was in line with a
> traditional live migration integration test that actually spins up QEMU) which
> used softRoCE in a localhost configuration that has full libibverbs supports and
> still allowed for compatibility testing with QEMU, would such an integration not
> be sufficient to handle the testing burden?
> 
> Comments welcome,
> - Michael
> 
> On 8/27/24 15:57, Michael S. Tsirkin wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Tue, Aug 27, 2024 at 04:15:42PM -0400, Peter Xu wrote:
> >> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> >>> From: Jialin Wang <wangjialin23@xxxxxxxxxx>
> >>>
> >>> Hi,
> >>>
> >>> This patch series attempts to refactor RDMA live migration by
> >>> introducing a new QIOChannelRDMA class based on the rsocket API.
> >>>
> >>> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> >>> that is a 1-1 match of the normal kernel 'sockets' API, which hides
> >>> the detail of rdma protocol into rsocket and allows us to add
> >>> support for some modern features like multifd more easily.
> >>>
> >>> Here is the previous discussion on refactoring RDMA live migration
> >>> using the rsocket API:
> >>>
> >>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20240
> >>>
> 328130255.52257-1-philmd@xxxxxxxxxx/__;!!GjvTz_vk!TuRaotO-yMj82o2kQo
> >>> 3x743jLoDElYgrXmp2wOfMTuCS1Y4k2Son1WGsRnZG_YYS9ZgBZ8uRHQ$
> >>>
> >>> We have encountered some bugs when using rsocket and plan to submit
> >>> them to the rdma-core community.
> >>>
> >>> In addition, the use of rsocket makes our programming more
> >>> convenient, but it must be noted that this method introduces
> >>> multiple memory copies, which can be imagined that there will be a
> >>> certain performance degradation, hoping that friends with RDMA network
> cards can help verify, thank you!
> >>>
> >>> Jialin Wang (6):
> >>>    migration: remove RDMA live migration temporarily
> >>>    io: add QIOChannelRDMA class
> >>>    io/channel-rdma: support working in coroutine
> >>>    tests/unit: add test-io-channel-rdma.c
> >>>    migration: introduce new RDMA live migration
> >>>    migration/rdma: support multifd for RDMA migration
> >> This series has been idle for a while; we still need to know how to
> >> move forward.
> >
> > What exactly is the question? This got a bunch of comments, the first
> > thing to do would be to address them.
> >
> >
> >>   I guess I lost the latest status quo..
> >>
> >> Any update (from anyone..) on what stage are we in?
> >>
> >> Thanks,
> >> --
> >> Peter Xu