Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Gonglei,

Jinpu and I have tested your patchset by using our migration test
cases on the physical RDMA cards. The result is: among 59 migration
test cases, 10 failed. They are successful when using the original
RDMA migration coed, but always fail when using the patchset. The
syslog on the source server shows an error below:

Jun  6 13:35:20 ps402a-43 WARN: Migration failed
uuid="44449999-3333-48dc-9082-1b6950e74ee1"
target=2a02:247f:401:2:2:0:a:2c error=Failed(Unable to write to
rsocket: Connection reset by peer)

We also tried to compare the migration speed between w/o the patchset.
Without the patchset, a big VM (with 16 cores, 64 GB memory) stressed
with heavy memory workload can be migrated successfully. With the
patchset, only a small idle VM (1-2 cores, 2-4 GB memory) can be
migrated successfully. In each failed migration, the above error is
issued on the source server.

Therefore, I assume that this version is not yet quite capable of
handling heavy load yet. I'm also looking in the code to see if
anything can be improved. We really appreciate your excellent work!

Best regards,
Yu Zhang @ IONOS cloud

On Wed, Jun 5, 2024 at 12:00 PM Gonglei (Arei) <arei.gonglei@xxxxxxxxxx> wrote:
>
>
>
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@xxxxxxxxxx]
> > Sent: Wednesday, June 5, 2024 3:57 PM
> > To: Gonglei (Arei) <arei.gonglei@xxxxxxxxxx>
> > Cc: qemu-devel@xxxxxxxxxx; peterx@xxxxxxxxxx; yu.zhang@xxxxxxxxx;
> > mgalaxy@xxxxxxxxxx; elmar.gerdes@xxxxxxxxx; zhengchuan
> > <zhengchuan@xxxxxxxxxx>; berrange@xxxxxxxxxx; armbru@xxxxxxxxxx;
> > lizhijian@xxxxxxxxxxx; pbonzini@xxxxxxxxxx; Xiexiangyou
> > <xiexiangyou@xxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx; lixiao (H)
> > <lixiao91@xxxxxxxxxx>; jinpu.wang@xxxxxxxxx; Wangjialin
> > <wangjialin23@xxxxxxxxxx>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> >
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang <wangjialin23@xxxxxxxxxx>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> >
> > So you didn't test it with an RDMA card?
>
> Yep, we tested it by Soft-ROCE.
>
> > You really should test with an RDMA card though, for correctness as much as
> > performance.
> >
> We will, we just don't have RDMA cards environment on hand at the moment.
>
> Regards,
> -Gonglei
>
> >
> > > Jialin Wang (6):
> > >   migration: remove RDMA live migration temporarily
> > >   io: add QIOChannelRDMA class
> > >   io/channel-rdma: support working in coroutine
> > >   tests/unit: add test-io-channel-rdma.c
> > >   migration: introduce new RDMA live migration
> > >   migration/rdma: support multifd for RDMA migration
> > >
> > >  docs/rdma.txt                     |  420 ---
> > >  include/io/channel-rdma.h         |  165 ++
> > >  io/channel-rdma.c                 |  798 ++++++
> > >  io/meson.build                    |    1 +
> > >  io/trace-events                   |   14 +
> > >  meson.build                       |    6 -
> > >  migration/meson.build             |    3 +-
> > >  migration/migration-stats.c       |    5 +-
> > >  migration/migration-stats.h       |    4 -
> > >  migration/migration.c             |   13 +-
> > >  migration/migration.h             |    9 -
> > >  migration/multifd.c               |   10 +
> > >  migration/options.c               |   16 -
> > >  migration/options.h               |    2 -
> > >  migration/qemu-file.c             |    1 -
> > >  migration/ram.c                   |   90 +-
> > >  migration/rdma.c                  | 4205 +----------------------------
> > >  migration/rdma.h                  |   67 +-
> > >  migration/savevm.c                |    2 +-
> > >  migration/trace-events            |   68 +-
> > >  qapi/migration.json               |   13 +-
> > >  scripts/analyze-migration.py      |    3 -
> > >  tests/unit/meson.build            |    1 +
> > >  tests/unit/test-io-channel-rdma.c |  276 ++
> > >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > > create mode 100644 io/channel-rdma.c  create mode 100644
> > > tests/unit/test-io-channel-rdma.c
> > >
> > > --
> > > 2.43.0
>





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux