On Wed, Jun 05, 2024 at 10:10:57AM -0400, Peter Xu wrote: > > e) Someone made a good suggestion (sorry can't remember who) - that the > > RDMA migration structure was the wrong way around - it should be the > > destination which initiates an RDMA read, rather than the source > > doing a write; then things might become a LOT simpler; you just need > > to send page ranges to the destination and it can pull it. > > That might work nicely for postcopy. > > I'm not sure whether it'll still be a problem if rdma recv side is based on > zero-copy. It would be a matter of whether atomicity can be guaranteed so > that we don't want the guest vcpus to see a partially copied page during > on-flight DMAs. UFFDIO_COPY (or friend) is currently the only solution for > that. And when thinking about this (of UFFDIO_COPY's nature on not being able to do zero-copy...), the only way this will be able to do zerocopy is to use file memories (shmem/hugetlbfs), as page cache can be prepopulated. So that when we do DMA we pass over the page cache, which can be mapped in another virtual address besides what the vcpus are using. Then we can use UFFDIO_CONTINUE (rather than UFFDIO_COPY) to do atomic updates on the vcpu pgtables, avoiding the copy. QEMU doesn't have it, but it looks like there's one more reason we may want to have better use of shmem.. than anonymous. And actually when working on 4k faults on 1G hugetlb I added CONTINUE support. https://github.com/xzpeter/qemu/tree/doublemap https://github.com/xzpeter/qemu/commit/b8aff3a9d7654b1cf2c089a06894ff4899740dc5 Maybe it's worthwhile on its own now, because it also means we can use that in multifd to avoid one extra layer of buffering when supporting multifd+postcopy (which has the same issue here on directly copying data into guest pages). It'll also work with things like rmda I think in similar ways. It's just that it'll not work on anonymous. I definitely hijacked the thread to somewhere too far away. I'll stop here.. Thanks, -- Peter Xu