On Fri, Sep 06, 2013 at 08:40:26PM +1000, Alexey Kardashevskiy wrote: > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > and H_STUFF_TCE requests targeted an IOMMU TCE table without passing > them to user space which saves time on switching to user space and back. > > Both real and virtual modes are supported. The kernel tries to > handle a TCE request in the real mode, if fails it passes the request > to the virtual mode to complete the operation. If it a virtual mode > handler fails, the request is passed to user space. > > The first user of this is VFIO on POWER. Trampolines to the VFIO external > user API functions are required for this patch. > > This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus > number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling > of map/unmap requests. The device supports a single attribute which is > a struct with LIOBN and IOMMU fd. When the attribute is set, the device > establishes the connection between KVM and VFIO. > > Tests show that this patch increases transmission speed from 220MB/s > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > > Signed-off-by: Paul Mackerras <paulus@xxxxxxxxx> > Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> > > --- > > Changes: > v10: > * all IOMMU TCE links are handled by one KVM device now > * KVM device has its own list of TCE descriptors > * the search-by-liobn function was extended to search through > emulated and IOMMU lists > > v9: > * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU" > KVM device > * release_spapr_tce_table() is not shared between different TCE types > * reduced the patch size by moving KVM device bits and VFIO external API > trampolines to separate patches > * moved documentation from Documentation/virtual/kvm/api.txt to > Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > > v8: > * fixed warnings from check_patch.pl > > 2013/07/11: > * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled > for KVM_BOOK3S_64 > * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense > for this here but the next patch for hugepages support will use it more. > > 2013/07/06: > * added realmode arch_spin_lock to protect TCE table from races > in real and virtual modes > * POWERPC IOMMU API is changed to support real mode > * iommu_take_ownership and iommu_release_ownership are protected by > iommu_table's locks > * VFIO external user API use rewritten > * multiple small fixes > > 2013/06/27: > * tce_list page is referenced now in order to protect it from accident > invalidation during H_PUT_TCE_INDIRECT execution > * added use of the external user VFIO API > > 2013/06/05: > * changed capability number > * changed ioctl number > * update the doc article number > > 2013/05/20: > * removed get_user() from real mode handlers > * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there > translated TCEs, tries realmode_get_page() on those and if it fails, it > passes control over the virtual mode handler which tries to finish > the request handling > * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit > on a page > * The only reason to pass the request to user mode now is when the user mode > did not register TCE table in the kernel, in all other cases the virtual mode > handler is expected to do the job > --- > .../virtual/kvm/devices/spapr_tce_iommu.txt | 40 +++ > arch/powerpc/include/asm/kvm_host.h | 8 + > arch/powerpc/include/uapi/asm/kvm.h | 5 - > arch/powerpc/kvm/book3s_64_vio.c | 327 ++++++++++++++++++++- > arch/powerpc/kvm/book3s_64_vio_hv.c | 142 +++++++++ > arch/powerpc/kvm/powerpc.c | 1 + > include/linux/kvm_host.h | 1 + > virt/kvm/kvm_main.c | 5 + > 8 files changed, 517 insertions(+), 12 deletions(-) > create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > > diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > new file mode 100644 > index 0000000..b911945 > --- /dev/null > +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt > @@ -0,0 +1,40 @@ > +SPAPR TCE IOMMU device > + > +Capability: KVM_CAP_SPAPR_TCE_IOMMU > +Architectures: powerpc > + > +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU > + > +Groups: > + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE > + Attributes: one VFIO IOMMU fd per LIOBN, indexed by LIOBN > + > +This is completely made up device which provides API to link > +logical bus number (LIOBN) and IOMMU group. The user space has > +to create a new SPAPR TCE IOMMU device once per KVM session > +and use "set_attr" to add or remove a logical bus. > + > +LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls > +(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE). > +IOMMU group is a minimal isolated device set which can be passed to > +the user space via VFIO. > + > +The userspace adds the new LIOBN-IOMMU link by calling KVM_SET_DEVICE_ATTR > +with the attribute initialized as shown below: > +struct kvm_device_attr attr = { > + .flags = 0, > + .group = KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE, > + .attr = liobn, > + .addr = (uint64_t)(uintptr_t)&group_fd, > +}; > + > +To remove the link, the userspace calls KVM_SET_DEVICE_ATTR with > +the group_fd equal to zero. > + Zero is a valid fd descriptor. Lest make it -1. > +As the device opens VFIO group fds and holds the file pointer, > +it does not need to keep an fd internally and therefore KVM_GET_DEVICE_ATTR > +is not supported. > + > +When KVM exits, all links are destroyed automatically. > + > +The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability. Why KVM_CAP_SPAPR_TCE_IOMMU is needed? Supported devices are discoverable without capabilities. > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h > index a23f132..a2a481e 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -181,9 +181,15 @@ struct kvmppc_spapr_tce_table { > struct kvm *kvm; > u64 liobn; > u32 window_size; > + struct iommu_group *grp; /* used for IOMMU groups */ > + struct vfio_group *vfio_grp; /* used for IOMMU groups */ > struct page *pages[0]; > }; > > +struct kvmppc_spapr_tce_iommu_device { > + struct list_head tables; > +}; > + > struct kvmppc_linear_info { > void *base_virt; > unsigned long base_pfn; > @@ -264,6 +270,7 @@ struct kvm_arch { > #endif /* CONFIG_KVM_BOOK3S_64_HV */ > #ifdef CONFIG_PPC_BOOK3S_64 > struct list_head spapr_tce_tables; > + struct kvmppc_spapr_tce_iommu_device *tcedev; > struct list_head rtas_tokens; > #endif > #ifdef CONFIG_KVM_MPIC > @@ -612,6 +619,7 @@ struct kvm_vcpu_arch { > u64 busy_preempt; > > unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT */ > + unsigned long tce_tmp_num; /* Number of handled TCEs in cache */ > enum { > TCERM_NONE, > TCERM_GETPAGE, > diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h > index c1ae1e5..a9d3d73 100644 > --- a/arch/powerpc/include/uapi/asm/kvm.h > +++ b/arch/powerpc/include/uapi/asm/kvm.h > @@ -512,11 +512,6 @@ struct kvm_get_htab_header { > #define KVM_XICS_PENDING (1ULL << 42) > > /* SPAPR TCE IOMMU device specification */ > -struct kvm_create_spapr_tce_iommu_linkage { > - __u64 liobn; > - __u32 fd; > - __u32 flags; > -}; > #define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE 0 > > #endif /* __LINUX_KVM_POWERPC_H */ > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > index 2880d2b..0978794 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -29,6 +29,8 @@ > #include <linux/anon_inodes.h> > #include <linux/module.h> > #include <linux/vfio.h> > +#include <linux/iommu.h> > +#include <linux/file.h> > > #include <asm/tlbflush.h> > #include <asm/kvm_ppc.h> > @@ -158,10 +160,8 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, > int i; > > /* Check this LIOBN hasn't been previously allocated */ > - list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) { > - if (stt->liobn == args->liobn) > - return -EBUSY; > - } > + if (kvmppc_find_tce_table(kvm, args->liobn)) > + return -EBUSY; > > npages = kvmppc_stt_npages(args->window_size); > > @@ -201,9 +201,175 @@ fail: > return ret; > } > > -/* Converts guest physical address to host virtual address */ > +static void kvmppc_spapr_tce_iommu_table_destroy( > + struct kvm_device *dev, > + struct kvmppc_spapr_tce_table *tt) > +{ > + struct kvm *kvm = dev->kvm; > + > + mutex_lock(&kvm->lock); > + list_del(&tt->list); > + > + if (tt->vfio_grp) > + kvmppc_vfio_group_put_external_user(tt->vfio_grp); > + iommu_group_put(tt->grp); > + > + kfree(tt); > + mutex_unlock(&kvm->lock); > +} > + > +static int kvmppc_spapr_tce_iommu_create(struct kvm_device *dev, u32 type) > +{ > + struct kvmppc_spapr_tce_iommu_device *tcedev; > + int ret = 0; > + > + tcedev = kzalloc(sizeof(*tcedev), GFP_KERNEL); > + if (!tcedev) > + return -ENOMEM; > + dev->private = tcedev; > + > + INIT_LIST_HEAD(&tcedev->tables); > + > + /* Already there ? */ > + mutex_lock(&dev->kvm->lock); > + if (dev->kvm->arch.tcedev) > + ret = -EEXIST; > + else > + dev->kvm->arch.tcedev = tcedev; > + mutex_unlock(&dev->kvm->lock); > + > + if (ret) Need to free tcedev here. > + return ret; > + > + return 0; > +} > + -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html