On 23/03/16 14:03, David Gibson wrote: > On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote: >> Uff, lost cc: list. Added back. Some comments below. >> >> >> On 03/21/2016 04:19 PM, David Gibson wrote: >>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote: >>>> On March 15, 2016 17:29:26 David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> wrote: >>>> >>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote: >>>>>> On 03/10/2016 04:21 PM, David Gibson wrote: >>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote: >>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote: >>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote: >>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap >>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU >>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and >>>>>>>>>> linked to IOMMU groups by the user space. >>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need >>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping. >>>>>>>>>> >>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter >>>>>>>>>> is added which accepts: >>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware >>>>>>>>>> TCE table; >>>>>>>>>> - a LIOBN to assign to the found table. >>>>>>>>>> >>>>>>>>>> Before notifying KVM about new link, this check the group for being >>>>>>>>>> registered with KVM device in order to release them at unexpected KVM >>>>>>>>>> finish. >>>>>>>>>> >>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user >>>>>>>>>> space. >>>>>>>>>> >>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it >>>>>>>>>> link to a KVM module. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> >>>>>>>>>> --- >>>>>>>>>> Documentation/virtual/kvm/devices/vfio.txt | 21 +++++- >>>>>>>>>> arch/powerpc/kvm/Kconfig | 1 + >>>>>>>>>> arch/powerpc/kvm/Makefile | 5 +- >>>>>>>>>> arch/powerpc/kvm/powerpc.c | 1 + >>>>>>>>>> include/uapi/linux/kvm.h | 9 +++ >>>>>>>>>> virt/kvm/vfio.c | 106 >>>>>> +++++++++++++++++++++++++++++ >>>>>>>>>> 6 files changed, 140 insertions(+), 3 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt >>>>>> b/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>>>> index ef51740..c0d3eb7 100644 >>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>>>> @@ -16,7 +16,24 @@ Groups: >>>>>>>>>> >>>>>>>>>> KVM_DEV_VFIO_GROUP attributes: >>>>>>>>>> KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking >>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>>>> + for the VFIO group. >>>>>>>>> >>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which >>>>>>>>> case it might be clearer to put them in a separate patch. >>>>>>>>> >>>>>>>>>> KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device >>>>>> tracking >>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>>>> + for the VFIO group. >>>>>>>>>> >>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>>>> -for the VFIO group. >>>>>>>>>> + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group >>>>>>>>>> + kvm_device_attr.addr points to a struct: >>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn { >>>>>>>>>> + __u32 argsz; >>>>>>>>>> + __s32 fd; >>>>>>>>>> + __u32 liobn; >>>>>>>>>> + __u8 pad[4]; >>>>>>>>>> + __u64 start_addr; >>>>>>>>>> + }; >>>>>>>>>> + where >>>>>>>>>> + @argsz is the size of kvm_vfio_spapr_tce_liobn; >>>>>>>>>> + @fd is a file descriptor for a VFIO group; >>>>>>>>>> + @liobn is a logical bus id to be associated with the group; >>>>>>>>>> + @start_addr is a DMA window offset on the IO (PCI) bus >>>>>>>>> >>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call >>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group? >>>>>>>> >>>>>>>> >>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit >>>>>>>> and 64bit windows, this is why @start_addr is there. >>>>>>>> >>>>>>>> >>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig >>>>>>>>>> index 1059846..dfa3488 100644 >>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig >>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig >>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64 >>>>>>>>>> select KVM >>>>>>>>>> select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE >>>>>>>>>> select SPAPR_TCE_IOMMU if IOMMU_SUPPORT >>>>>>>>>> + select KVM_VFIO if VFIO >>>>>>>>>> ---help--- >>>>>>>>>> Support running unmodified book3s_64 and book3s_32 guest kernels >>>>>>>>>> in virtual machines on book3s_64 host processors. >>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile >>>>>>>>>> index 7f7b6d8..71f577c 100644 >>>>>>>>>> --- a/arch/powerpc/kvm/Makefile >>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile >>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm >>>>>>>>>> KVM := ../../../virt/kvm >>>>>>>>>> >>>>>>>>>> common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \ >>>>>>>>>> - $(KVM)/eventfd.o $(KVM)/vfio.o >>>>>>>>>> + $(KVM)/eventfd.o >>>>>>>>> >>>>>>>>> Please don't disable the VFIO device for the non-book3s case. I added >>>>>>>>> it (even though it didn't do anything until now) so that libvirt >>>>>>>>> wouldn't choke when it finds it's not available. Obviously the new >>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device >>>>>>>>> itself should be available always. >>>>>>>> >>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module. >>>>>>>> >>>>>>>> >>>>>>>>>> CFLAGS_e500_mmu.o := -I. >>>>>>>>>> CFLAGS_e500_mmu_host.o := -I. >>>>>>>>>> @@ -87,6 +87,9 @@ endif >>>>>>>>>> kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \ >>>>>>>>>> book3s_xics.o >>>>>>>>>> >>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \ >>>>>>>>>> + $(KVM)/vfio.o \ >>>>>>>>>> + >>>>>>>>>> kvm-book3s_64-module-objs += \ >>>>>>>>>> $(KVM)/kvm_main.o \ >>>>>>>>>> $(KVM)/eventfd.o \ >>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>>>>>>>>> index 19aa59b..63f188d 100644 >>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c >>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c >>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm >>>>>> *kvm, long ext) >>>>>>>>>> #ifdef CONFIG_PPC_BOOK3S_64 >>>>>>>>>> case KVM_CAP_SPAPR_TCE: >>>>>>>>>> case KVM_CAP_SPAPR_TCE_64: >>>>>>>>>> + case KVM_CAP_SPAPR_TCE_VFIO: >>>>>>>>>> case KVM_CAP_PPC_ALLOC_HTAB: >>>>>>>>>> case KVM_CAP_PPC_RTAS: >>>>>>>>>> case KVM_CAP_PPC_FIXUP_HCALL: >>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h >>>>>>>>>> index 080ffbf..f1abbea 100644 >>>>>>>>>> --- a/include/uapi/linux/kvm.h >>>>>>>>>> +++ b/include/uapi/linux/kvm.h >>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr { >>>>>>>>>> #define KVM_DEV_VFIO_GROUP 1 >>>>>>>>>> #define KVM_DEV_VFIO_GROUP_ADD 1 >>>>>>>>>> #define KVM_DEV_VFIO_GROUP_DEL 2 >>>>>>>>>> +#define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN 3 >>>>>>>>>> >>>>>>>>>> enum kvm_device_type { >>>>>>>>>> KVM_DEV_TYPE_FSL_MPIC_20 = 1, >>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type { >>>>>>>>>> KVM_DEV_TYPE_MAX, >>>>>>>>>> }; >>>>>>>>>> >>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn { >>>>>>>>>> + __u32 argsz; >>>>>>>>>> + __s32 fd; >>>>>>>>>> + __u32 liobn; >>>>>>>>>> + __u8 pad[4]; >>>>>>>>>> + __u64 start_addr; >>>>>>>>>> +}; >>>>>>>>>> + >>>>>>>>>> /* >>>>>>>>>> * ioctls for VM fds >>>>>>>>>> */ >>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c >>>>>>>>>> index 1dd087d..87c771e 100644 >>>>>>>>>> --- a/virt/kvm/vfio.c >>>>>>>>>> +++ b/virt/kvm/vfio.c >>>>>>>>>> @@ -20,6 +20,10 @@ >>>>>>>>>> #include <linux/vfio.h> >>>>>>>>>> #include "vfio.h" >>>>>>>>>> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>> +#include <asm/kvm_ppc.h> >>>>>>>>>> +#endif >>>>>>>>>> + >>>>>>>>>> struct kvm_vfio_group { >>>>>>>>>> struct list_head node; >>>>>>>>>> struct vfio_group *vfio_group; >>>>>>>>>> @@ -60,6 +64,22 @@ static void >>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group) >>>>>>>>>> symbol_put(vfio_group_put_external_user); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group) >>>>>>>>>> +{ >>>>>>>>>> + int (*fn)(struct vfio_group *); >>>>>>>>>> + int ret = -1; >>>>>>>>> >>>>>>>>> Should this be -ESOMETHING? >>>>>>>>> >>>>>>>>>> + fn = symbol_get(vfio_external_user_iommu_id); >>>>>>>>>> + if (!fn) >>>>>>>>>> + return ret; >>>>>>>>>> + >>>>>>>>>> + ret = fn(vfio_group); >>>>>>>>>> + >>>>>>>>>> + symbol_put(vfio_external_user_iommu_id); >>>>>>>>>> + >>>>>>>>>> + return ret; >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>>> static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group) >>>>>>>>>> { >>>>>>>>>> long (*fn)(struct vfio_group *, unsigned long); >>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct >>>>>> kvm_device *dev) >>>>>>>>>> mutex_unlock(&kv->lock); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm, >>>>>>>>>> + struct vfio_group *vfio_group) >>>>>>>>> >>>>>>>>> >>>>>>>>> Shouldn't this go in the same patch that introduced the attach >>>>>>>>> function? >>>>>>>> >>>>>>>> Having less patches which touch different maintainers areas is better. I >>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in >>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE >>>>>>>> table". >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> +{ >>>>>>>>>> + int group_id; >>>>>>>>>> + struct iommu_group *grp; >>>>>>>>>> + >>>>>>>>>> + group_id = kvm_vfio_external_user_iommu_id(vfio_group); >>>>>>>>>> + grp = iommu_group_get_by_id(group_id); >>>>>>>>>> + if (grp) { >>>>>>>>>> + kvm_spapr_tce_detach_iommu_group(kvm, grp); >>>>>>>>>> + iommu_group_put(grp); >>>>>>>>>> + } >>>>>>>>>> +} >>>>>>>>>> +#endif >>>>>>>>>> + >>>>>>>>>> static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg) >>>>>>>>>> { >>>>>>>>>> struct kvm_vfio *kv = dev->private; >>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device >>>>>> *dev, long attr, u64 arg) >>>>>>>>>> continue; >>>>>>>>>> >>>>>>>>>> list_del(&kvg->node); >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>> >>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the >>>>>>>>> callsite. >>>>>>>> >>>>>>>> It is questionable. A x86 reader may decide that >>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get >>>>>>>> confused. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> + kvm_vfio_spapr_detach_iommu_group(dev->kvm, >>>>>>>>>> + kvg->vfio_group); >>>>>>>>>> +#endif >>>>>>>>>> kvm_vfio_group_put_external_user(kvg->vfio_group); >>>>>>>>>> kfree(kvg); >>>>>>>>>> ret = 0; >>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device >>>>>> *dev, long attr, u64 arg) >>>>>>>>>> kvm_vfio_update_coherency(dev); >>>>>>>>>> >>>>>>>>>> return ret; >>>>>>>>>> + >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>> + case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: { >>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn param; >>>>>>>>>> + unsigned long minsz; >>>>>>>>>> + struct kvm_vfio *kv = dev->private; >>>>>>>>>> + struct vfio_group *vfio_group; >>>>>>>>>> + struct kvm_vfio_group *kvg; >>>>>>>>>> + struct fd f; >>>>>>>>>> + >>>>>>>>>> + minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn, >>>>>>>>>> + start_addr); >>>>>>>>>> + >>>>>>>>>> + if (copy_from_user(¶m, (void __user *)arg, minsz)) >>>>>>>>>> + return -EFAULT; >>>>>>>>>> + >>>>>>>>>> + if (param.argsz < minsz) >>>>>>>>>> + return -EINVAL; >>>>>>>>>> + >>>>>>>>>> + f = fdget(param.fd); >>>>>>>>>> + if (!f.file) >>>>>>>>>> + return -EBADF; >>>>>>>>>> + >>>>>>>>>> + vfio_group = kvm_vfio_group_get_external_user(f.file); >>>>>>>>>> + fdput(f); >>>>>>>>>> + >>>>>>>>>> + if (IS_ERR(vfio_group)) >>>>>>>>>> + return PTR_ERR(vfio_group); >>>>>>>>>> + >>>>>>>>>> + ret = -ENOENT; >>>>>>>>> >>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU? It's >>>>>>>>> possible a kernel could be built for a platform supporting multiple >>>>>>>>> IOMMU types. >>>>>>>> >>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a >>>>>>>> VFIO container property, not a group property and here (KVM) we only have >>>>>>>> groups. >>>>>>> >>>>>>> Which, as mentioned previously, is broken. >>>>>> >>>>>> Which I am failing to follow you on this. >>>>>> >>>>>> What I am trying to achieve here is pretty much referencing a group so it >>>>>> cannot be reused. Plus LIOBNs. >>>>> >>>>> "Plus LIOBNs" is not a trivial change. You are establishing a linkage >>>> >from LIOBNs to groups. But that doesn't make sense; if mapping in one >>>>> (guest) LIOBN affects a group it must affect all groups in the >>>>> container. i.e. LIOBN->container is the natural mapping, *not* LIOBN >>>>> to group. >>>> >>>> I can see your point but i don't see how to proceed now, I'm totally stuck. >>>> Pass container fd and then implement new api to lock containers somehow and >>> >>> I'm not really understanding what the question is about locking containers. >>> >>>> enumerate groups when updating TCE table (including real mode)? >>> >>> Why do you need to enumerate groups? The groups within the container >>> share a TCE table, so can't you just update that once? >> >> Well, they share a TCE table but they do not share TCE Kill (TCE cache >> invalidate) register address, it is still per PE but this does not matter >> here (pnv_pci_link_table_and_group() does that), just mentioned to complete >> the picture. > > True, you'll need to enumerate the groups for invalidates. But you > need that already, right. > >>>> Plus new API when we remove a group from a container as the result of guest >>>> PCI hot unplug? >>> >>> I assume you mean a kernel internal API, since it shouldn't need >>> anything else visible to userspace. Won't this happen naturally? >>> When the group is removed from the container, it will get its own TCE >>> table instead of the previously shared one. >>> >>>>>> Passing a container fd does not make much >>>>>> sense here as the VFIO device would walk through groups, reference them and >>>>>> that is it, there is no locking on VFIO containters and so far there was no >>>>>> need to teach KVM about containers. >>>>>> >>>>>> What do I miss now? >>>>> >>>>> Referencing the groups is essentially just a useful side effect. The >>>>> important functionality is informing VFIO of the guest LIOBNs; and >>>>> LIOBNs map to containers, not groups. >>>> >>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There >>>> can be one or many or none containers per liobn. >>> >>> Ah, true. >> >> So I need to add new kernel API for KVM to get table(s) from VFIO >> container(s). And invent some locking mechanism to prevent table(s) (or >> associated container(s)) from going away, like >> vfio_group_get_external_user/vfio_group_put_external_user but for >> containers. Correct? > > Well, the container is attached to an fd, so if you get a reference on > the file* that should do it. I am still trying to think of how to implement this suggestion. I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right interface as it knows nothing about KVM. There is VFIO-KVM device but it does not have idea about containers. So I have to: Wenever a container is created or removed, notify the VFIO-KVM device by passing there a container fd. ok. Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to what LIOBN so the realmode handlers could do the job. The real mode TCE handlers get LIOBN, find a guest view table and update it. Now I want to update the hardware table which is iommu_table attached to LIOBN. I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd. Now VFIO-KVM device needs to extract iommu_table's from the container. These iommu_table pointers are stored in "struct tce_container" which is local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I cannot export and use that. The other way to go would be adding API to VFIO to enumerate IOMMU groups in a container and use iommu_table pointers stored in iommu_table_group of each group (in fact the very first group will be enough as multiple groups in a container share the table). Adding vfio_container_get_groups() when only first one is needed is quite tricky in terms of maintainers approvals. So what would be the right course of action? Thanks. -- Alexey -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html