Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

Jinpu Wang <jinpu.wang@xxxxxxxxx> · Thu, 11 Apr 2024 16:42:23 +0200

Hi Peter,

On Tue, Apr 9, 2024 at 9:47 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> > Hi Peter,
> >
> > On Mon, Apr 8, 2024 at 6:18 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > > Hi Peter,
> > >
> > > Jinpu,
> > >
> > > Thanks for joining the discussion.
> > >
> > > >
> > > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > > Hello Peter und Zhjian,
> > > > > >
> > > > > > Thank you so much for letting me know about this. I'm also a bit surprised at
> > > > > > the plan for deprecating the RDMA migration subsystem.
> > > > >
> > > > > It's not too late, since it looks like we do have users not yet notified
> > > > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > > > plan, and it'll be 2 releases after this.
> > > > >
> > > > > >
> > > > > > > IMHO it's more important to know whether there are still users and whether
> > > > > > > they would still like to see it around.
> > > > > >
> > > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > > > > > obvious bugs being noticed too late.
> > > > > >
> > > > > > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > > migration test
> > > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > > and 8.2, but
> > > > > > was unable to confirm and report it quickly to you.
> > > > > >
> > > > > > The maintenance of this part could be too costly or difficult from
> > > > > > your point of view.
> > > > >
> > > > > It may or may not be too costly, it's just that we need real users of RDMA
> > > > > taking some care of it.  Having it broken easily for >1 releases definitely
> > > > > is a sign of lack of users.  It is an implication to the community that we
> > > > > should consider dropping some features so that we can get the best use of
> > > > > the community resources for the things that may have a broader audience.
> > > > >
> > > > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > > features that will be merged at least for migration subsystem, so that we
> > > > > plan to not merge anything that is not covered by CI unless extremely
> > > > > necessary in the future.
> > > > >
> > > > > For sure CI is not the only missing part, but I'd say we should start with
> > > > > it, then someone should also take care of the code even if only in
> > > > > maintenance mode (no new feature to add on top).
> > > > >
> > > > > >
> > > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > > many) like us
> > > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > > version of QEMU,
> > > > > > or to abandon the currently used RDMA migration.
> > > > >
> > > > > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > > possible to switch to the TCP protocol with some good NICs?
> > > > We have used rdma migration with HCA from Nvidia for years, our
> > > > experience is RDMA migration works better than tcp (over ipoib).
> > >
> > > Please bare with me, as I know little on rdma stuff.
> > >
> > > I'm actually pretty confused (and since a long time ago..) on why we need
> > > to operation with rdma contexts when ipoib seems to provide all the tcp
> > > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > > even if there's rdma/ib hardwares underneath?  Is it because of performance
> > > improvements so that we must use a separate path comparing to generic
> > > "tcp:" protocol here?
> > using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
> > talking directly to NIC which bypasses the kernel overhead, less cpu
> > utilization and better performance.
> >
> > While IPoIB is more for compatibility to  applications using tcp, but
> > can't get full benefit of RDMA.  When you have mix generation of IB
> > devices, there are performance issue on IPoIB, we've seen 40G HCA can
> > only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
> > speed.
> >
> > I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:
> >
> > iperf 3.9
> > Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> > 07:19:34 UTC 2024 x86_64
> > -----------------------------------------------------------
> > Server listening on 5201
> > -----------------------------------------------------------
> > Time: Tue, 09 Apr 2024 06:55:02 GMT
> > Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
> >       Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
> >       TCP MSS: 0 (default)
> > [  5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
> > 2a02:247f:401:4:2:0:b:3 port 41136
> > Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
> > 0 seconds, 10 second test, tos 0
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec  1.80 GBytes  15.5 Gbits/sec
> > [  5]   1.00-2.00   sec  1.85 GBytes  15.9 Gbits/sec
> > [  5]   2.00-3.00   sec  1.88 GBytes  16.2 Gbits/sec
> > [  5]   3.00-4.00   sec  1.87 GBytes  16.1 Gbits/sec
> > [  5]   4.00-5.00   sec  1.88 GBytes  16.2 Gbits/sec
> > [  5]   5.00-6.00   sec  1.93 GBytes  16.6 Gbits/sec
> > [  5]   6.00-7.00   sec  2.00 GBytes  17.2 Gbits/sec
> > [  5]   7.00-8.00   sec  1.93 GBytes  16.6 Gbits/sec
> > [  5]   8.00-9.00   sec  1.86 GBytes  16.0 Gbits/sec
> > [  5]   9.00-10.00  sec  1.95 GBytes  16.8 Gbits/sec
> > [  5]  10.00-10.04  sec  85.2 MBytes  17.3 Gbits/sec
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > Test Complete. Summary Results:
> > [ ID] Interval           Transfer     Bitrate
> > [  5] (sender statistics not available)
> > [  5]   0.00-10.04  sec  19.0 GBytes  16.3 Gbits/sec                  receiver
> > rcv_tcp_congestion cubic
> > iperf 3.9
> > Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> > 07:19:34 UTC 2024 x86_64
> > -----------------------------------------------------------
> > Server listening on 5201
> > -----------------------------------------------------------
> > ^Ciperf3: interrupt - the server has terminated
> > 1 jwang@xxxxxxxxxxxx:~$ sudo ib_send_bw -F -a
> >
> > ************************************
> > * Waiting for client to connect... *
> > ************************************
> > ---------------------------------------------------------------------------------------
> >                     Send BW Test
> >  Dual-port       : OFF Device         : mlx5_0
> >  Number of qps   : 1 Transport type : IB
> >  Connection type : RC Using SRQ      : OFF
> >  PCIe relax order: ON
> >  ibv_wr* API     : ON
> >  RX depth        : 512
> >  CQ Moderation   : 100
> >  Mtu             : 4096[B]
> >  Link type       : IB
> >  Max inline data : 0[B]
> >  rdma_cm QPs : OFF
> >  Data ex. method : Ethernet
> > ---------------------------------------------------------------------------------------
> >  local address: LID 0x24 QPN 0x0174 PSN 0x300138
> >  remote address: LID 0x17 QPN 0x004a PSN 0xc54d6f
> > ---------------------------------------------------------------------------------------
> >  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
> >  2          1000             0.00               6.46       3.385977
> >  4          1000             0.00               10.38     2.721894
> >  8          1000             0.00               25.69     3.367830
> >  16         1000             0.00               41.46     2.716859
> >  32         1000             0.00               102.98    3.374577
> >  64         1000             0.00               206.12    3.377053
> >  128        1000             0.00               405.03    3.318007
> >  256        1000             0.00               821.52    3.364939
> >  512        1000             0.00               2150.78    4.404803
> >  1024       1000             0.00               4288.13    4.391044
> >  2048       1000             0.00               8518.25    4.361346
> >  4096       1000             0.00               11440.77    2.928836
> >  8192       1000             0.00               11526.45    1.475385
> >  16384      1000             0.00               11526.06    0.737668
> >  32768      1000             0.00               11524.86    0.368795
> >  65536      1000             0.00               11331.84    0.181309
> >  131072     1000             0.00               11524.75    0.092198
> >  262144     1000             0.00               11525.82    0.046103
> >  524288     1000             0.00               11524.70    0.023049
> >  1048576    1000             0.00               11510.84    0.011511
> >  2097152    1000             0.00               11524.58    0.005762
> >  4194304    1000             0.00               11514.26    0.002879
> >  8388608    1000             0.00               11511.01    0.001439
> > ---------------------------------------------------------------------------------------
> >
> > you can see with ipoib, it reaches 16 Gb/s using TCP, 1 streams,
> > 131072 byte blocks
> > with RDMA at 4k+ message size it reaches 100 Gb/s
>
> I get it now, thank you!
>
> >
> >
> > >
> > > >
> > > > Switching back to TCP will lead us to the old problems which was
> > > > solved by RDMA migration.
> > >
> > > Can you elaborate the problems, and why tcp won't work in this case?  They
> > > may not be directly relevant to the issue we're discussing, but I'm happy
> > > to learn more.
> > >
> > > What is the NICs you were testing before?  Did the test carry out with
> > > things like modern ones (50Gbps-200Gbps NICs), or the test was done when
> > > these hardwares are not common?
> > We use Mellanox/NVidia IB HCA from 40 Gb/s to 200 Gb/s mixed
> > generation across globe.
> > >
> > > Per my recent knowledge on the new Intel hardwares, at least the ones that
> > > support QPL, it's easy to achieve single core 50Gbps+.
> > In good case, I've also seen 50 Gbps + on Mellanox HCA.
>
> I see. Have you compared the HCAs v.s. the modern NICs?  Now NICs can
> achieve similar performance from their spec as I said; I am not sure how
> they perform in real life, but maybe worth trying.  I only tried 100G nic
> and I rem I can hit 70+Gbps with multifd migrations at peak bandwidth.
> Have you tried that before?
Yes, I recently tried 100 G Eth NIC, with only iperf not yet with qemu
migration.
yes, iperf can reach 90 Gbps with multiple streams.
>
> Note that here I didn't want to compare the performance between the two and
> find a winner.  The issue we're facing now is we have the RDMA migration
> now mostly having its own path all over the place, while the rest protocols
> (socket, fd, file, etc.) all share the rest.
>
> Then, _if_ modern NICs can work similarly v.s. rdma, I don't yet see a good
> reason to keep it.  It could be that technology just improved so we can use
> less code to do as good.  It's a good news to help QEMU evolve by dropping
> unused code.
>
> For some details there on the rdma complications for migration:
>
>   (1) RDMA is the only protocol that doesn't yet support QIOChannel, while
>       migration uses QIOChannels mostly everywhere now.. e.g. in multifd,
>       it means it won't easily support any new things using QIOChannels.
>
>   (2) RDMA is the only protocol that mostly hard-coded everywhere in the
>       RAM migrations, polluting the core logic with much more code
>       internally to support this protocol.
>
> For (1), see migrate_fd_connect() from rdma_start_outgoing_migration().
> While the rest protocols all go via migration_channel_connect().
>
> For (2), see all the "rdma_*" functions in migration/ram.c, where I don't
> think it's common to a protocol - most of the rest protocols don't need
> those hard-coded stuff.  migration/rdma.c has 4000+ LOC for these stuff,
> while to do a not-so-fair comparison, migration/fd.c only has <100 LOC.
>
> Then, we found we don't even know who's using it.
>
> I hope I explained why people started this idea, and also why I think that
> makes sense at least to me.
Yes, I can understand rdma migration become more a burden for upstream
maintainers.
>
> > >
> > > https://lore.kernel.org/r/PH7PR11MB5941A91AC1E514BCC32896A6A3342@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > >
> > > Quote from Yuan:
> > >
> > >   Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
> > >   [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > >   [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> > >   [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> > >
> > >   And in the live migration test, a multifd thread's CPU utilization is almost 100%
> > >
> > > It boils down to what old problems were there with tcp first, though.
> > Yeah, this is the key reason we use RDMA. (low cpu ulitization and
> > better performance)
> > >
> > > >
> > > > >
> > > > > Per our best knowledge, RDMA users are rare, and please let anyone know if
> > > > > you are aware of such users.  IIUC the major reason why RDMA stopped being
> > > > > the trend is because the network is not like ten years ago; I don't think I
> > > > > have good knowledge in RDMA at all nor network, but my understanding is
> > > > > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> > > > > little sense to maintain multiple protocols, considering RDMA migration
> > > > > code is so special so that it has the most custom code comparing to other
> > > > > protocols.
> > > > +cc some guys from Huawei.
> > > >
> > > > I'm surprised RDMA users are rare,  I guess maybe many are just
> > > > working with different code base.
> > >
> > > Yes, please cc whoever might be interested (or surprised.. :) to know this,
> > > and let's be open to all possibilities.
> > >
> > > I don't think it makes sense if there're a lot of users of a feature then
> > > we deprecate that without a good reason.  However there's always the
> > > resource limitation issue we're facing, so it could still have the
> > > possibility that this gets deprecated if nobody is working on our upstream
> > > branch. Say, if people use private branches anyway to support rdma without
> > > collaborating upstream, keeping such feature upstream then may not make
> > > much sense either, unless there's some way to collaborate.  We'll see.
> >
> > Is there document/link about the unittest/CI for migration tests, Why
> > are those tests missing?
> > Is it hard or very special to set up an environment for that? maybe we
> > can help in this regards.
>
> See tests/qtest/migration-test.c.  We put most of our migration tests
> there and that's covered in CI.
Yu is looking into that see if we can run the CI on our side.
>
> I think one major issue is CI systems don't normally have rdma devices.
> Can rdma migration test be carried out without a real hardware?
As Zhijian mentioned we can use the SoftRoCE (rxe)
>
> > >
> > > It seems there can still be people joining this discussion.  I'll hold off
> > > a bit on merging this patch to provide enough window for anyone to chim in.
> >
> > Thx for discussion and understanding.
>
> Thanks for all these inputs so far.  These can help us make a wiser and
> clearer step no matter which way we choose.
>
> --
> Peter Xu
>
Thx!
_______________________________________________
Devel mailing list -- devel@xxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxx