Re: [Qemu-devel] host side todo list for virtio rdma

"Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx> · Fri, 28 Jul 2017 19:02:55 +0100

* Michael S. Tsirkin (mst@xxxxxxxxxx) wrote:
> On Wed, Jul 19, 2017 at 11:55:50AM +0100, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (mst@xxxxxxxxxx) wrote:
> > > Here are some thoughts on bits that are still missing to get a working
> > > virtio-rdma, with some suggestions. These are very preliminary but I
> > > feel I kept these in my head (and discussed offline) for too long. All
> > > of the below is just my personal humble opinion.
> > > 
> > > Feature Requirements:
> > > 
> > > The basic requirement is to be able to do RDMA to/from
> > > VM memory, with support for VM migration and/or memory
> > > overcommit and/or autonuma and/or THP.
> > > Why are migration/overcommit/autonuma required?
> > > Without these, you can do RDMA with device passthrough,
> > > with likely better performance.
> > 
> > Is this solution usable on a system without host-RDMA hardware?
> > i.e. just to run RDMA between two VMs on the same host
> > without using something like SoftROCE on the host?
> 
> Hacks could be implemented to enable this. But IMHO this
> is yes another thing that should be a follow-up.
> Just like e.g. KVM, let's focus on capable hardware
> as the 1st step.

OK, it's an interesting question whether if you have got the hardware
whether it's better to use the hardware or the CPU to do inter-VM on
the same host RDMA.

> 
> > > Feature Non-requirements:
> > > 
> > > It's not a requirement to support RDMA without VM exits,
> > > e.g. like with device passthrough. While avoiding exits improves
> > > performance, it would be handy to more than RDMA,
> > > so there seems no reason to require it from RDMA when we
> > > do not have it for e.g. network.
> > > 
> > > Assumptions:
> > > 
> > > It's OK to assume specific hardware capabilities at least initially.
> > > 
> > > High level architecture:
> > > 
> > > Follows the same lines as most other virtio devices:
> > > 
> > > +-----------------------------------
> > > + 
> > > + guest kernel
> > > +             ^
> > > +-------------|----------------------
> > > +             v
> > > + host kernel (kvm, vhost)
> > > + 
> > > +             ^
> > > +-------------|----------------------
> > > +             v
> > > + 
> > > + host userspace (QEMU, vhost-user)
> > > + 
> > > +-----------------------------------
> > > 
> > > Each request is forwarded by host kernel to QEMU,
> > > that executes it using the ibverbs library.
> > 
> > Should that be 'forwarded by guest kernel' ?
> 
> No I really mean the host: we get requests from guest, they land in host
> kernel same as any exit.

OK, does it seem silly for a message to go all the way to the host
kernel to then have to go back down to QEMU to be turned back into verbs
to go back upto the host kernel?

> > Is there a guest-userspace here as well - most of the
> > RDMA NICs seem to have a userspace component.
> 
> Good point, I think you are right, there is. Bypassing
> guest kernel for data path requests seems like a reasonable
> requirement to add.
> 
> 
> > > Most of this should be implementable host-side using existing
> > > software. However, several issues remain and would need
> > > infrastructure changes, as outlined below.
> > > 
> > > Host-side todo list for virtio-rdma support:
> > > 
> > > - Memory registration for guest userspace.
> > > 
> > >   Register memory region verb accepts a single virtual address,
> > >   which supplies both the on-wire key for access and the
> > >   range of memory to access. Guest kernel turns this into a
> > >   list of pages (e.g. by get_user_pages); when forwarded to host this
> > >   turns into a s/g list of virtual addresses in QEMU address space.
> > > 
> > >   Suggestion: add a new verb, along the lines of ibv_register_physical,
> > >   which splits these two parameters, accepting the on-wire VA key
> > >   and separately a list of userspace virtual address/size pairs.
> > > 
> > > - Memory registration for guest kernels.
> > > 
> > >   Another ability used by some in-kernel users is registering all of memory.
> > >   Ranges not actually present are never accessed - this is OK as
> > >   kernel users are trusted. Memory hotplug changes which ranges
> > >   are present.
> > > 
> > >   Suggestion: add some throw-away memory and map all
> > >   non-present ranges there. Add ibv_reregister_physical_mr or similar
> > >   API to update mappings on guest memory hotplug/unplug.
> > > 
> > > - Memory overcommit/autonuma/THP.
> > > 
> > >   This includes techniques such as swap,KSM,COW, page migration.
> > >   All these rely on ability to move pages around without
> > >   breaking hardware access.
> > > 
> > >   Suggestion: for hardware that supports it,
> > >   enabling on-demand paging for all registered memory seems
> > >   to address the issue more or less transparently to guests.
> > >   This isn't supported by all hardware but might be
> > >   at least a reasonable first step.
> > > 
> > > - Migration: memory tracking.
> > > 
> > >   Migration requires detecting hardware access to pages
> > >   either on write (pre-copy) or any access (post-copy).
> > >   Post copy just requires ODP support to work with
> > >   userfaultfd properly.
> > 
> > Can you explain what ODP support is?
> 
> On demand paging. grep for odp and ODP in libibverbs sources.

OK, that sounds like another chunk of work needed for postcopy
but OK.

Dave

> > >   Pre-copy would require a write-tracking API along
> > >   the lines of one exposed by KVM or vhost.
> > >   Each tracked page would be write-protected (causing faults on
> > >   hardware access) on hardware write fault is generated
> > >   and recorded, page is made writeable.
> > 
> > Can you write-protect like that from the RDMA hardware?
> > I'd be surprised if the hardware was happy with that.
> 
> With ODP capable hardware I think you should be able to.
> 
> > > - Migration: moving QP numbers.
> > > 
> > >   QP numbers are exposed on the wire and so must move together
> > >   with the VM.
> > > 
> > >   Suggestion: allow specifying QP number when creating a QP.
> > >   To avoid conflicts between multiple users, initial version can limit
> > >   library to a single user per device. Multiple VMs can simply
> > >   attach to distinct VFs.
> > > 
> > > - Migration: moving QP state.
> > > 
> > >   When migrating the VM, a QP has to be torn down
> > >   on source and created on destination.
> > >   We have to migrate e.g. the current PSN - but what
> > >   should happen when a new packet arrives on source
> > >   after QP has been torn down?
> > > 
> > >   Suggestion 1: move QP to a special state "suspended" and ignore
> > >   packets, or cause source to retransmit with e.g. an out of
> > >   resources error. Retransmit counter might need to be
> > >   adjusted compared to what guest requested to account
> > >   for the extra need to retransmit.
> > >   Is there a good existing QP state that does this?
> > > 
> > >   Suggestion 2: forward packets to destination somehow.
> > >   Might overload the fabric as we are crossing e.g.
> > >   pci bus multiple times.
> > > 
> > > - Migration: network update
> > > 
> > >   ROCE v1 and infiniband seem to tie connections to
> > >   hardware specific GIDs which can not be moved by software.
> > > 
> > >   Suggestion: limit migration to RoCE v2 initially.
> > > 
> > > - Migration: packet loss recovery.
> > > 
> > >   As a RoCE address moves across the network, network has
> > >   to be updated which takes time, meanwhile packet loss seems
> > >   to be hard to avoid.
> > > 
> > >   Suggestion: limit initial support to hardware that is
> > >   able to recover from occasional packet drops, with
> > >   some slowdown.
> > > 
> > > - Migration: suspend/resume API?
> > >   It might be easier to pack up state of all resources
> > >   such as all QP numbers and state of all QPs etc
> > >   in a single memory buffer, migrate then unpack on destination.
> > > 
> > >   Removes need for 2 separate APIs for suspended state and
> > >   for specifying QPN on creation.
> > > 
> > >   This creates a format for serialization that will have to
> > >   be maintained in a compatible way - it is not clear that
> > >   the maintainance overhead is worth the potential
> > >   simplification, if any.
> > > 
> > > 
> > > That's it - I hope this helps, feel free to discuss, preferably copying
> > > virtio-dev (subscription required for now, people are looking into
> > > fixing this, sorry about that).
> > 
> > Dave
> > 
> > > Thanks!
> > > 
> > > -- 
> > > MST
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html