Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

Peter Xu <peterx@xxxxxxxxxx> · Wed, 5 Jun 2024 10:10:57 -0400

Hey, Dave!

On Wed, Jun 05, 2024 at 12:31:56AM +0000, Dr. David Alan Gilbert wrote:
> * Michael Galaxy (mgalaxy@xxxxxxxxxx) wrote:
> > One thing to keep in mind here (despite me not having any hardware to test)
> > was that one of the original goals here
> > in the RDMA implementation was not simply raw throughput nor raw latency,
> > but a lack of CPU utilization in kernel
> > space due to the offload. While it is entirely possible that newer hardware
> > w/ TCP might compete, the significant
> > reductions in CPU usage in the TCP/IP stack were a big win at the time.
> > 
> > Just something to consider while you're doing the testing........
> 
> I just noticed this thread; some random notes from a somewhat
> fragmented memory of this:
> 
>   a) Long long ago, I also tried rsocket; 
>       https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
>      as I remember the library was quite flaky at the time.

Hmm interesting.  There also looks like a thread doing rpoll().

Btw, not sure whether you noticed, but there's the series posted for the
latest rsocket conversion here:

https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gonglei@xxxxxxxxxx

I hope Lei and his team has tested >4G mem, otherwise definitely worth
checking.  Lei also mentioned there're rsocket bugs they found in the cover
letter, but not sure what's that about.

> 
>   b) A lot of the complexity in the rdma migration code comes from
>     emulating a stream to carry the migration control data and interleaving
>     that with the actual RAM copy.   I believe the original design used
>     a separate TCP socket for the control data, and just used the RDMA
>     for the data - that should be a lot simpler (but alas was rejected
>     in review early on)
> 
>   c) I can't rememmber the last benchmarks I did; but I think I did
>     manage to beat RDMA with multifd; but yes, multifd does eat host CPU
>     where as RDMA barely uses a whisper.

I think my first impression on this matter came from you on this one. :)

> 
>   d) The 'zero-copy-send' option in migrate may well get some of that
>      CPU time back; but if I remember we were still bottle necked on
>      the receive side. (I can't remember if zero-copy-send worked with
>      multifd?)

Yes, and zero-copy requires multifd for now. I think it's because we didn't
want to complicate the header processings in the migration stream where it
may not be page aligned.

> 
>   e) Someone made a good suggestion (sorry can't remember who) - that the
>      RDMA migration structure was the wrong way around - it should be the
>      destination which initiates an RDMA read, rather than the source
>      doing a write; then things might become a LOT simpler; you just need
>      to send page ranges to the destination and it can pull it.
>      That might work nicely for postcopy.

I'm not sure whether it'll still be a problem if rdma recv side is based on
zero-copy.  It would be a matter of whether atomicity can be guaranteed so
that we don't want the guest vcpus to see a partially copied page during
on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
that.

Thanks,

-- 
Peter Xu