Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Michael and Peter,

We are very glad at your quick and kind reply about our plan to take
over the maintenance of your code. The message is for presenting our
plan and working together.
If we were able to obtain the maintainer's role, our plan is:

1. Create the necessary unit-test cases and get them integrated into
the current QEMU GitLab-CI pipeline
2. Review and test the code changes by other developers to ensure that
nothing is broken in the changed code before being merged by the
community
3. Based on our current practice and application scenario, look for
possible improvements when necessary

Besides that, a patch is attached to announce this change in the community.

With your generous support, we hope that the development community
will make a positive decision for us.

Kind regards,
Yu Zhang@ IONOS Cloud

On Mon, Apr 29, 2024 at 4:57 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > Hi All (and Peter),
>
> Hi, Michael,
>
> >
> > My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > (highly irregular for a male) and yes, that's my real last name:
> > https://www.linkedin.com/in/mrgalaxy/)
> >
> > I'm the original author of the RDMA implementation. I've been discussing
> > with Yu Zhang for a little bit about potentially handing over maintainership
> > of the codebase to his team.
> >
> > I simply have zero access to RoCE or Infiniband hardware at all,
> > unfortunately. so I've never been able to run tests or use what I wrote at
> > work, and as all of you know, if you don't have a way to test something,
> > then you can't maintain it.
> >
> > Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > they feel comfortable training his team to maintain the codebase (and run
> > tests) while they learn about it.
>
> The "while learning" part is fine at least to me.  IMHO the "ownership" to
> the code, or say, taking over the responsibility, may or may not need 100%
> mastering the code base first.  There should still be some fundamental
> confidence to work on the code though as a starting point, then it's about
> serious use case to back this up, and careful testings while getting more
> familiar with it.
>
> >
> > If you don't mind, I'd like to let him send over his (very detailed)
> > proposal,
>
> Yes please, it's exactly the time to share the plan.  The hope is we try to
> reach a consensus before or around the middle of this release (9.1).
> Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
> not yet out, but I think it means we make a decision before or around
> middle of June.
>
> Thanks,
>
> >
> > - Michael
> >
> > On 4/11/24 11:36, Yu Zhang wrote:
> > > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > > >      periodically tests for each QEMU release will be needed.
> > > We use a batch of regression test cases for the stack, which covers the
> > > test for QEMU. I did such test for most of the QEMU releases planned as
> > > candidates for rollout.
> > >
> > > The migration test needs a pair of (either physical or virtual) servers with
> > > InfiniBand network, which makes it difficult to do on a single server. The
> > > nested VM could be a possible approach, for which we may need virtual
> > > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> > >
> > > [1]  https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > >
> > > Thanks and best regards!
> > >
> > > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > > On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
> > > > > >
> > > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > >
> > > > > > > > Is there document/link about the unittest/CI for migration tests, Why
> > > > > > > > are those tests missing?
> > > > > > > > Is it hard or very special to set up an environment for that? maybe we
> > > > > > > > can help in this regards.
> > > > > > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > > > > > there and that's covered in CI.
> > > > > > >
> > > > > > > I think one major issue is CI systems don't normally have rdma devices.
> > > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > >
> > > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > > "rxe_lo", however when
> > > > > > I tried(years ago) to do RDMA migration over this
> > > > > > interface(rdma:127.0.0.1:3333) , it got something wrong.
> > > > > > So i gave up enabling the RDMA migration qtest at that time.
> > > > > Thanks, Zhijian.
> > > > >
> > > > > I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> > > > > Maybe someone more familiar with how CI works can chim in.
> > > > Some people got dropped on the cc list for unknown reason, I'm adding them
> > > > back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
> > > > accident.
> > > >
> > > > I'll try to summarize what is still missing, and I think these will be
> > > > greatly helpful if we don't want to deprecate rdma migration:
> > > >
> > > >    1) Either a CI test covering at least the major RDMA paths, or at least
> > > >       periodically tests for each QEMU release will be needed.
> > > >
> > > >    2) Some performance tests between modern RDMA and NIC devices are
> > > >       welcomed.  The current knowledge is modern NIC can work similarly to
> > > >       RDMA in performance, then it's debatable why we still maintain so much
> > > >       rdma specific code.
> > > >
> > > >    3) No need to be soild patchsets for this one, but some plan to improve
> > > >       RDMA migration code so that it is not almost isolated from the rest
> > > >       protocols.
> > > >
> > > >    4) Someone to look after this code for real.
> > > >
> > > > For 2) and 3) more info is here:
> > > >
> > > > https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$
> > > >
> > > > Here 4) can be the most important as Markus pointed out.  We just didn't
> > > > get there yet on the discussions, but maybe Markus is right that we should
> > > > talk that first.
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Peter Xu
> > > >
> >
>
> --
> Peter Xu
>
From 40dea392f9ca606c2a0c53999d662670eb08b2d8 Mon Sep 17 00:00:00 2001
From: Yu Zhang <yu.zhang@xxxxxxxxx>
Date: Mon, 29 Apr 2024 15:31:53 +0200
Subject: [PATCH] MAINTAINERS: Update the maintainers and reviewers for RDMA
 migration

As the links [1][2] below stated, QEMU development community is currently
having some difficulties in maintaining the RDMA migration subsystem due
to the lack of resources (maintainers, test cases, test environment etc.)
and considering to deprecate it.

According to our user experience in the recent two years, we observed that
RDMA is capable of providing higher migration speed and lower performance
impact to a running VM, which can significantly improve the end-user's
experience during the VM live migration. We believe that RDMA still plays
a key role for the QoS and can't yet be replaced by TCP/IP for VM migration
at the moment.

With the consent and supports from Michael Galaxy, who has developed this
feature for QEMU, we would like to take over the maintainer's role and
create the necessary resources to maintain it further for the community.

Jinpu Wang is the upstream maintainer of RNBD/RTRS. He is experienced in
RDMA programming, and Yu Zhang maintains the downstream QEMU for IONOS
cloud in production.

[1] https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg00001.html
[2] https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg00228.html

Signed-off-by: Yu Zhang <yu.zhang@xxxxxxxxx>
Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxx>
---
 MAINTAINERS | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 302b6fd00c..54d32dff94 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3413,7 +3413,10 @@ F: util/userfaultfd.c
 X: migration/rdma*
 
 RDMA Migration
+M: Yu Zhang <yu.zhang@xxxxxxxxx>
+M: Jack Wang <jack.wang@xxxxxxxxx>
 R: Li Zhijian <lizhijian@xxxxxxxxxxx>
+R: Michael Galaxy <mgalaxy@xxxxxxxxxx>
 R: Peter Xu <peterx@xxxxxxxxxx>
 S: Odd Fixes
 F: migration/rdma*
-- 
2.25.1

_______________________________________________
Devel mailing list -- devel@xxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxx

[Index of Archives]     [Virt Tools]     [Libvirt Users]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]     [Fedora Tools]

  Powered by Linux