On Wed, Jul 19, 2017 at 11:55:50AM +0100, Dr. David Alan Gilbert wrote: > * Michael S. Tsirkin (mst@xxxxxxxxxx) wrote: > > Here are some thoughts on bits that are still missing to get a working > > virtio-rdma, with some suggestions. These are very preliminary but I > > feel I kept these in my head (and discussed offline) for too long. All > > of the below is just my personal humble opinion. > > > > Feature Requirements: > > > > The basic requirement is to be able to do RDMA to/from > > VM memory, with support for VM migration and/or memory > > overcommit and/or autonuma and/or THP. > > Why are migration/overcommit/autonuma required? > > Without these, you can do RDMA with device passthrough, > > with likely better performance. > > Is this solution usable on a system without host-RDMA hardware? > i.e. just to run RDMA between two VMs on the same host > without using something like SoftROCE on the host? Hacks could be implemented to enable this. But IMHO this is yes another thing that should be a follow-up. Just like e.g. KVM, let's focus on capable hardware as the 1st step. > > Feature Non-requirements: > > > > It's not a requirement to support RDMA without VM exits, > > e.g. like with device passthrough. While avoiding exits improves > > performance, it would be handy to more than RDMA, > > so there seems no reason to require it from RDMA when we > > do not have it for e.g. network. > > > > Assumptions: > > > > It's OK to assume specific hardware capabilities at least initially. > > > > High level architecture: > > > > Follows the same lines as most other virtio devices: > > > > +----------------------------------- > > + > > + guest kernel > > + ^ > > +-------------|---------------------- > > + v > > + host kernel (kvm, vhost) > > + > > + ^ > > +-------------|---------------------- > > + v > > + > > + host userspace (QEMU, vhost-user) > > + > > +----------------------------------- > > > > Each request is forwarded by host kernel to QEMU, > > that executes it using the ibverbs library. > > Should that be 'forwarded by guest kernel' ? No I really mean the host: we get requests from guest, they land in host kernel same as any exit. > Is there a guest-userspace here as well - most of the > RDMA NICs seem to have a userspace component. Good point, I think you are right, there is. Bypassing guest kernel for data path requests seems like a reasonable requirement to add. > > Most of this should be implementable host-side using existing > > software. However, several issues remain and would need > > infrastructure changes, as outlined below. > > > > Host-side todo list for virtio-rdma support: > > > > - Memory registration for guest userspace. > > > > Register memory region verb accepts a single virtual address, > > which supplies both the on-wire key for access and the > > range of memory to access. Guest kernel turns this into a > > list of pages (e.g. by get_user_pages); when forwarded to host this > > turns into a s/g list of virtual addresses in QEMU address space. > > > > Suggestion: add a new verb, along the lines of ibv_register_physical, > > which splits these two parameters, accepting the on-wire VA key > > and separately a list of userspace virtual address/size pairs. > > > > - Memory registration for guest kernels. > > > > Another ability used by some in-kernel users is registering all of memory. > > Ranges not actually present are never accessed - this is OK as > > kernel users are trusted. Memory hotplug changes which ranges > > are present. > > > > Suggestion: add some throw-away memory and map all > > non-present ranges there. Add ibv_reregister_physical_mr or similar > > API to update mappings on guest memory hotplug/unplug. > > > > - Memory overcommit/autonuma/THP. > > > > This includes techniques such as swap,KSM,COW, page migration. > > All these rely on ability to move pages around without > > breaking hardware access. > > > > Suggestion: for hardware that supports it, > > enabling on-demand paging for all registered memory seems > > to address the issue more or less transparently to guests. > > This isn't supported by all hardware but might be > > at least a reasonable first step. > > > > - Migration: memory tracking. > > > > Migration requires detecting hardware access to pages > > either on write (pre-copy) or any access (post-copy). > > Post copy just requires ODP support to work with > > userfaultfd properly. > > Can you explain what ODP support is? On demand paging. grep for odp and ODP in libibverbs sources. > > Pre-copy would require a write-tracking API along > > the lines of one exposed by KVM or vhost. > > Each tracked page would be write-protected (causing faults on > > hardware access) on hardware write fault is generated > > and recorded, page is made writeable. > > Can you write-protect like that from the RDMA hardware? > I'd be surprised if the hardware was happy with that. With ODP capable hardware I think you should be able to. > > - Migration: moving QP numbers. > > > > QP numbers are exposed on the wire and so must move together > > with the VM. > > > > Suggestion: allow specifying QP number when creating a QP. > > To avoid conflicts between multiple users, initial version can limit > > library to a single user per device. Multiple VMs can simply > > attach to distinct VFs. > > > > - Migration: moving QP state. > > > > When migrating the VM, a QP has to be torn down > > on source and created on destination. > > We have to migrate e.g. the current PSN - but what > > should happen when a new packet arrives on source > > after QP has been torn down? > > > > Suggestion 1: move QP to a special state "suspended" and ignore > > packets, or cause source to retransmit with e.g. an out of > > resources error. Retransmit counter might need to be > > adjusted compared to what guest requested to account > > for the extra need to retransmit. > > Is there a good existing QP state that does this? > > > > Suggestion 2: forward packets to destination somehow. > > Might overload the fabric as we are crossing e.g. > > pci bus multiple times. > > > > - Migration: network update > > > > ROCE v1 and infiniband seem to tie connections to > > hardware specific GIDs which can not be moved by software. > > > > Suggestion: limit migration to RoCE v2 initially. > > > > - Migration: packet loss recovery. > > > > As a RoCE address moves across the network, network has > > to be updated which takes time, meanwhile packet loss seems > > to be hard to avoid. > > > > Suggestion: limit initial support to hardware that is > > able to recover from occasional packet drops, with > > some slowdown. > > > > - Migration: suspend/resume API? > > It might be easier to pack up state of all resources > > such as all QP numbers and state of all QPs etc > > in a single memory buffer, migrate then unpack on destination. > > > > Removes need for 2 separate APIs for suspended state and > > for specifying QPN on creation. > > > > This creates a format for serialization that will have to > > be maintained in a compatible way - it is not clear that > > the maintainance overhead is worth the potential > > simplification, if any. > > > > > > That's it - I hope this helps, feel free to discuss, preferably copying > > virtio-dev (subscription required for now, people are looking into > > fixing this, sorry about that). > > Dave > > > Thanks! > > > > -- > > MST > > > -- > Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html