On Wed, 14 Dec 2016 14:53:13 +1100 Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > On 10/12/16 02:35, Alex Williamson wrote: > > On Fri, 9 Dec 2016 18:53:43 +1100 > > Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > > > >> On 09/12/16 04:55, Alex Williamson wrote: > >>> On Thu, 8 Dec 2016 19:19:56 +1100 > >>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > >>> > >>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > >>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO > >>>> without passing them to user space which saves time on switching > >>>> to user space and back. > >>>> > >>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. > >>>> KVM tries to handle a TCE request in the real mode, if failed > >>>> it passes the request to the virtual mode to complete the operation. > >>>> If it a virtual mode handler fails, the request is passed to > >>>> the user space; this is not expected to happen though. > >>>> > >>>> To avoid dealing with page use counters (which is tricky in real mode), > >>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required > >>>> to pre-register the userspace memory. The very first TCE request will > >>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view > >>>> of the TCE table (iommu_table::it_userspace) is not allocated till > >>>> the very first mapping happens and we cannot call vmalloc in real mode. > >>>> > >>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to > >>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd > >>>> and associates a physical IOMMU table with the SPAPR TCE table (which > >>>> is a guest view of the hardware IOMMU table). The iommu_table object > >>>> is referenced so we do not have to retrieve in real mode when hypercall > >>>> happens. > >>>> > >>>> This does not implement the UNSET counterpart as there is no use for it - > >>>> once the acceleration is enabled, the existing userspace won't > >>>> disable it unless a VFIO container is detroyed so this adds necessary > >>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. > >>>> > >>>> This uses the kvm->lock mutex to protect against a race between > >>>> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's > >>>> release() callback. > >>>> > >>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user > >>>> space. > >>>> > >>>> This finally makes use of vfio_external_user_iommu_id() which was > >>>> introduced quite some time ago and was considered for removal. > >>>> > >>>> Tests show that this patch increases transmission speed from 220MB/s > >>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > >>>> > >>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> > >>>> --- > >>>> Documentation/virtual/kvm/devices/vfio.txt | 21 +- > >>>> arch/powerpc/include/asm/kvm_host.h | 8 + > >>>> arch/powerpc/include/asm/kvm_ppc.h | 5 + > >>>> include/uapi/linux/kvm.h | 8 + > >>>> arch/powerpc/kvm/book3s_64_vio.c | 302 +++++++++++++++++++++++++++++ > >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 178 +++++++++++++++++ > >>>> arch/powerpc/kvm/powerpc.c | 2 + > >>>> virt/kvm/vfio.c | 108 +++++++++++ > >>>> 8 files changed, 630 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt > >>>> index ef51740c67ca..ddb5a6512ab3 100644 > >>>> --- a/Documentation/virtual/kvm/devices/vfio.txt > >>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt > >>>> @@ -16,7 +16,24 @@ Groups: > >>>> > >>>> KVM_DEV_VFIO_GROUP attributes: > >>>> KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking > >>>> + kvm_device_attr.addr points to an int32_t file descriptor > >>>> + for the VFIO group. > >>>> KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking > >>>> + kvm_device_attr.addr points to an int32_t file descriptor > >>>> + for the VFIO group. > >>>> + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table > >>>> + allocated by sPAPR KVM. > >>>> + kvm_device_attr.addr points to a struct: > >>>> > >>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor > >>>> -for the VFIO group. > >>>> + struct kvm_vfio_spapr_tce { > >>>> + __u32 argsz; > >>>> + __s32 groupfd; > >>>> + __s32 tablefd; > >>>> + __u8 pad[4]; > >>>> + }; > >>>> + > >>>> + where > >>>> + @argsz is the size of kvm_vfio_spapr_tce_liobn; > >>>> + @groupfd is a file descriptor for a VFIO group; > >>>> + @tablefd is a file descriptor for a TCE table allocated via > >>>> + KVM_CREATE_SPAPR_TCE. > >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h > >>>> index 28350a294b1e..94774503c70d 100644 > >>>> --- a/arch/powerpc/include/asm/kvm_host.h > >>>> +++ b/arch/powerpc/include/asm/kvm_host.h > >>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo { > >>>> atomic_t refcnt; > >>>> }; > >>>> > >>>> +struct kvmppc_spapr_tce_iommu_table { > >>>> + struct rcu_head rcu; > >>>> + struct list_head next; > >>>> + struct iommu_table *tbl; > >>>> + atomic_t refs; > >>>> +}; > >>>> + > >>>> struct kvmppc_spapr_tce_table { > >>>> struct list_head list; > >>>> struct kvm *kvm; > >>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table { > >>>> u32 page_shift; > >>>> u64 offset; /* in pages */ > >>>> u64 size; /* window size in pages */ > >>>> + struct list_head iommu_tables; > >>>> struct page *pages[0]; > >>>> }; > >>>> > >>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h > >>>> index 0a21c8503974..17b947a0060d 100644 > >>>> --- a/arch/powerpc/include/asm/kvm_ppc.h > >>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h > >>>> @@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm, > >>>> extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu, > >>>> struct kvm_memory_slot *memslot, unsigned long porder); > >>>> extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); > >>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, > >>>> + int tablefd, > >>>> + struct iommu_group *grp); > >>>> +extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm, > >>>> + struct iommu_group *grp); > >>>> > >>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, > >>>> struct kvm_create_spapr_tce_64 *args); > >>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > >>>> index 810f74317987..9e4025724e28 100644 > >>>> --- a/include/uapi/linux/kvm.h > >>>> +++ b/include/uapi/linux/kvm.h > >>>> @@ -1068,6 +1068,7 @@ struct kvm_device_attr { > >>>> #define KVM_DEV_VFIO_GROUP 1 > >>>> #define KVM_DEV_VFIO_GROUP_ADD 1 > >>>> #define KVM_DEV_VFIO_GROUP_DEL 2 > >>>> +#define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE 3 > >>>> > >>>> enum kvm_device_type { > >>>> KVM_DEV_TYPE_FSL_MPIC_20 = 1, > >>>> @@ -1089,6 +1090,13 @@ enum kvm_device_type { > >>>> KVM_DEV_TYPE_MAX, > >>>> }; > >>>> > >>>> +struct kvm_vfio_spapr_tce { > >>>> + __u32 argsz; > >>>> + __s32 groupfd; > >>>> + __s32 tablefd; > >>>> + __u8 pad[4]; > >>>> +}; > >>> > >>> I assume you're implementing argsz and padding for future expansion, > >>> but it doesn't really work. Presumably argsz would be set to 16, so > >>> the only way that the kernel will ever know something has changed would > >>> be to make it bigger, so the padding bytes are really reserved, and > >>> then it's not clear why we have padding at all. If you replaced the > >>> padding with a __u32 flags, then we could actually have some room to > >>> architect expansion, but as it is we might as well drop both argsz and > >>> padding. > >> > >> Ah, right, sorry, did not pay attention to this bit this time. I'll replace > >> pad with flags and move to argsz. > >> > >> > >>> > >>>> + > >>>> /* > >>>> * ioctls for VM fds > >>>> */ > >>> ... > >>>> --- a/virt/kvm/vfio.c > >>>> +++ b/virt/kvm/vfio.c > >>>> @@ -20,6 +20,10 @@ > >>>> #include <linux/vfio.h> > >>>> #include "vfio.h" > >>>> > >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>> +#include <asm/kvm_ppc.h> > >>>> +#endif > >>>> + > >>>> struct kvm_vfio_group { > >>>> struct list_head node; > >>>> struct vfio_group *vfio_group; > >>>> @@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group) > >>>> return ret > 0; > >>>> } > >>>> > >>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group) > >>>> +{ > >>>> + int (*fn)(struct vfio_group *); > >>>> + int ret = -1; > >>>> + > >>>> + fn = symbol_get(vfio_external_user_iommu_id); > >>>> + if (!fn) > >>>> + return ret; > >>>> + > >>>> + ret = fn(vfio_group); > >>>> + > >>>> + symbol_put(vfio_external_user_iommu_id); > >>>> + > >>>> + return ret; > >>>> +} > >>>> + > >>>> /* > >>>> * Groups can use the same or different IOMMU domains. If the same then > >>>> * adding a new group may change the coherency of groups we've previously > >>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev) > >>>> mutex_unlock(&kv->lock); > >>>> } > >>>> > >>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm, > >>>> + struct vfio_group *vfio_group) > >>>> +{ > >>>> + int group_id; > >>>> + struct iommu_group *grp; > >>>> + > >>>> + group_id = kvm_vfio_external_user_iommu_id(vfio_group); > >>>> + grp = iommu_group_get_by_id(group_id); > >>>> + if (grp) { > >>>> + kvm_spapr_tce_detach_iommu_group(kvm, grp); > >>>> + iommu_group_put(grp); > >>>> + } > >>>> +} > >>>> +#endif > >>> > >>> > >>> I wonder if you could use the new vfio group notifier to avoid tainting > >>> this code with SPAPR_TCE #ifdefs. Thanks, > >> > >> I cannot see how... The new notifier sets kvm to a group, I need the > >> opposite - attach a group to kvm and not just to KVM but to a specific KVM > >> SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id). > >> > >> The only way I see how I can avoid tainting this code is adding another > >> ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried > >> that few years ago and I was told to add a KVM device or even reuse VFIO > >> KVM device. > >> > >> What am I missing here? > > > > You would still need a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, but the ugly > > part of encompassing this all in #ifdefs is that we call > > kvm_spapr_tce_{at,de}tach_iommu_group() directly. The notifier would > > sort of make it like an arch callback, vfio core would set these > > attributes and broadcast them to notifier callbacks, on non-spapr-tce > > platforms nobody would be listening for those notifications. > > Ultimately I don't know how much cleaner it makes things, but it maybe > > avoids spapr-tce #ifdefs leaking into every layer of the stack. Thanks, > > I am failing here. > > The normal workflow: > - create SPAPR TCE object in KVM, represents 1 LIOBN aka DMA window; > - attach IOMMU group to SPAPR TCE object, in this step the hardware tables > (1 or 2 iommu_table objects) are put to the SPAPR TCE's list of attached > tables; the tables are referenced. > > When reboot happens, the SPAPR TCE object is destroyed and new guest starts > from the very beginning. > > > The task I am solving: dereference iommu_table (hardware table) in 2 cases: > 1) guest reboot - SPAPR TCE table is destroyed but VFIO KVM device is still > there with all attached groups; > 2) VFIO PCI hot unplug - SPAPR TCE table is there but groups need to be > detached from the VFIO KVM device. > > > Tried fixing 2) with the new callbacks: > > - they do not take iommu_group, they take devices - fixed by duplicating > the vfio_(un)register_notifier API: > * vfio_iommu_group_register_notifier() > * vfio_iommu_group_unregister_notifier() > plus kvm wrappers with symbol_get/symbol_put in > arch/powerpc/kvm/book3s_64_vio.c. > > - the callbacks are registered per a IOMMU group, the only place in this > code which knows about groups is SPAPR TCE driver but attach_group() > callback is called when IOMMU driver is not yet set -> > vfio_register_notifier() fails. KVM does not know about groups until > KVM_DEV_VFIO_GROUP_ADD/KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE so I register > callback in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE; > > - the callback does not pass vfio_group pointer, only kvm; so my notifier > needs to be wrapped into a struct with a group pointer, ok, done; > > - the notifier needs to be unregistered and the wrapper struct from the > previous step needs to be freed. No nice mechanisms for that - I cannot > unregister a notifier from a notifier itself. I fixed it by calling > rcu_sched() from the notifier when KVM==NULL and RCU-scheduled callback > calls vfio_iommu_group_unregister_notifier(). > > Looks quite ugly already but ok. > > > Now I am trying to solve 1). I can dereference iommu_table objects but > registered notifiers remain in memory and they are not freed so each guest > reboot increases the list length. I do not have a way to access vfio_group > structs from KVM, there I only have a list of iommu_table structs, each of > which has a list of iommu_group structs (this is done this way to make real > mode handlers possible) but there is no way to get to vfio_group struct > from iommu_group. > > I can duplicate group list once again, this time is will be vfio_group list > attached to SPAPR TCE object but this all seems to be way to much, does not > it?... Ok, thanks for trying. It does seem like it gets pretty complicated, too bad. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html