On 10/06/16 16:50, David Gibson wrote: > On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote: >> On 23/03/16 14:03, David Gibson wrote: >>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote: >>>> Uff, lost cc: list. Added back. Some comments below. >>>> >>>> >>>> On 03/21/2016 04:19 PM, David Gibson wrote: >>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote: >>>>>> On March 15, 2016 17:29:26 David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote: >>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote: >>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote: >>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote: >>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote: >>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap >>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU >>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and >>>>>>>>>>>> linked to IOMMU groups by the user space. >>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need >>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping. >>>>>>>>>>>> >>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter >>>>>>>>>>>> is added which accepts: >>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware >>>>>>>>>>>> TCE table; >>>>>>>>>>>> - a LIOBN to assign to the found table. >>>>>>>>>>>> >>>>>>>>>>>> Before notifying KVM about new link, this check the group for being >>>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM >>>>>>>>>>>> finish. >>>>>>>>>>>> >>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user >>>>>>>>>>>> space. >>>>>>>>>>>> >>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it >>>>>>>>>>>> link to a KVM module. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> >>>>>>>>>>>> --- >>>>>>>>>>>> Documentation/virtual/kvm/devices/vfio.txt | 21 +++++- >>>>>>>>>>>> arch/powerpc/kvm/Kconfig | 1 + >>>>>>>>>>>> arch/powerpc/kvm/Makefile | 5 +- >>>>>>>>>>>> arch/powerpc/kvm/powerpc.c | 1 + >>>>>>>>>>>> include/uapi/linux/kvm.h | 9 +++ >>>>>>>>>>>> virt/kvm/vfio.c | 106 >>>>>>>> +++++++++++++++++++++++++++++ >>>>>>>>>>>> 6 files changed, 140 insertions(+), 3 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>>>>>> index ef51740..c0d3eb7 100644 >>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt >>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups: >>>>>>>>>>>> >>>>>>>>>>>> KVM_DEV_VFIO_GROUP attributes: >>>>>>>>>>>> KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking >>>>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>>>>>> + for the VFIO group. >>>>>>>>>>> >>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which >>>>>>>>>>> case it might be clearer to put them in a separate patch. >>>>>>>>>>> >>>>>>>>>>>> KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device >>>>>>>> tracking >>>>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>>>>>> + for the VFIO group. >>>>>>>>>>>> >>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor >>>>>>>>>>>> -for the VFIO group. >>>>>>>>>>>> + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group >>>>>>>>>>>> + kvm_device_attr.addr points to a struct: >>>>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn { >>>>>>>>>>>> + __u32 argsz; >>>>>>>>>>>> + __s32 fd; >>>>>>>>>>>> + __u32 liobn; >>>>>>>>>>>> + __u8 pad[4]; >>>>>>>>>>>> + __u64 start_addr; >>>>>>>>>>>> + }; >>>>>>>>>>>> + where >>>>>>>>>>>> + @argsz is the size of kvm_vfio_spapr_tce_liobn; >>>>>>>>>>>> + @fd is a file descriptor for a VFIO group; >>>>>>>>>>>> + @liobn is a logical bus id to be associated with the group; >>>>>>>>>>>> + @start_addr is a DMA window offset on the IO (PCI) bus >>>>>>>>>>> >>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call >>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit >>>>>>>>>> and 64bit windows, this is why @start_addr is there. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig >>>>>>>>>>>> index 1059846..dfa3488 100644 >>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig >>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64 >>>>>>>>>>>> select KVM >>>>>>>>>>>> select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE >>>>>>>>>>>> select SPAPR_TCE_IOMMU if IOMMU_SUPPORT >>>>>>>>>>>> + select KVM_VFIO if VFIO >>>>>>>>>>>> ---help--- >>>>>>>>>>>> Support running unmodified book3s_64 and book3s_32 guest kernels >>>>>>>>>>>> in virtual machines on book3s_64 host processors. >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile >>>>>>>>>>>> index 7f7b6d8..71f577c 100644 >>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile >>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm >>>>>>>>>>>> KVM := ../../../virt/kvm >>>>>>>>>>>> >>>>>>>>>>>> common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \ >>>>>>>>>>>> - $(KVM)/eventfd.o $(KVM)/vfio.o >>>>>>>>>>>> + $(KVM)/eventfd.o >>>>>>>>>>> >>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case. I added >>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt >>>>>>>>>>> wouldn't choke when it finds it's not available. Obviously the new >>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device >>>>>>>>>>> itself should be available always. >>>>>>>>>> >>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> CFLAGS_e500_mmu.o := -I. >>>>>>>>>>>> CFLAGS_e500_mmu_host.o := -I. >>>>>>>>>>>> @@ -87,6 +87,9 @@ endif >>>>>>>>>>>> kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \ >>>>>>>>>>>> book3s_xics.o >>>>>>>>>>>> >>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \ >>>>>>>>>>>> + $(KVM)/vfio.o \ >>>>>>>>>>>> + >>>>>>>>>>>> kvm-book3s_64-module-objs += \ >>>>>>>>>>>> $(KVM)/kvm_main.o \ >>>>>>>>>>>> $(KVM)/eventfd.o \ >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>>>>>>>>>>> index 19aa59b..63f188d 100644 >>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c >>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c >>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm >>>>>>>> *kvm, long ext) >>>>>>>>>>>> #ifdef CONFIG_PPC_BOOK3S_64 >>>>>>>>>>>> case KVM_CAP_SPAPR_TCE: >>>>>>>>>>>> case KVM_CAP_SPAPR_TCE_64: >>>>>>>>>>>> + case KVM_CAP_SPAPR_TCE_VFIO: >>>>>>>>>>>> case KVM_CAP_PPC_ALLOC_HTAB: >>>>>>>>>>>> case KVM_CAP_PPC_RTAS: >>>>>>>>>>>> case KVM_CAP_PPC_FIXUP_HCALL: >>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h >>>>>>>>>>>> index 080ffbf..f1abbea 100644 >>>>>>>>>>>> --- a/include/uapi/linux/kvm.h >>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h >>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr { >>>>>>>>>>>> #define KVM_DEV_VFIO_GROUP 1 >>>>>>>>>>>> #define KVM_DEV_VFIO_GROUP_ADD 1 >>>>>>>>>>>> #define KVM_DEV_VFIO_GROUP_DEL 2 >>>>>>>>>>>> +#define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN 3 >>>>>>>>>>>> >>>>>>>>>>>> enum kvm_device_type { >>>>>>>>>>>> KVM_DEV_TYPE_FSL_MPIC_20 = 1, >>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type { >>>>>>>>>>>> KVM_DEV_TYPE_MAX, >>>>>>>>>>>> }; >>>>>>>>>>>> >>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn { >>>>>>>>>>>> + __u32 argsz; >>>>>>>>>>>> + __s32 fd; >>>>>>>>>>>> + __u32 liobn; >>>>>>>>>>>> + __u8 pad[4]; >>>>>>>>>>>> + __u64 start_addr; >>>>>>>>>>>> +}; >>>>>>>>>>>> + >>>>>>>>>>>> /* >>>>>>>>>>>> * ioctls for VM fds >>>>>>>>>>>> */ >>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c >>>>>>>>>>>> index 1dd087d..87c771e 100644 >>>>>>>>>>>> --- a/virt/kvm/vfio.c >>>>>>>>>>>> +++ b/virt/kvm/vfio.c >>>>>>>>>>>> @@ -20,6 +20,10 @@ >>>>>>>>>>>> #include <linux/vfio.h> >>>>>>>>>>>> #include "vfio.h" >>>>>>>>>>>> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>>>> +#include <asm/kvm_ppc.h> >>>>>>>>>>>> +#endif >>>>>>>>>>>> + >>>>>>>>>>>> struct kvm_vfio_group { >>>>>>>>>>>> struct list_head node; >>>>>>>>>>>> struct vfio_group *vfio_group; >>>>>>>>>>>> @@ -60,6 +64,22 @@ static void >>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group) >>>>>>>>>>>> symbol_put(vfio_group_put_external_user); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group) >>>>>>>>>>>> +{ >>>>>>>>>>>> + int (*fn)(struct vfio_group *); >>>>>>>>>>>> + int ret = -1; >>>>>>>>>>> >>>>>>>>>>> Should this be -ESOMETHING? >>>>>>>>>>> >>>>>>>>>>>> + fn = symbol_get(vfio_external_user_iommu_id); >>>>>>>>>>>> + if (!fn) >>>>>>>>>>>> + return ret; >>>>>>>>>>>> + >>>>>>>>>>>> + ret = fn(vfio_group); >>>>>>>>>>>> + >>>>>>>>>>>> + symbol_put(vfio_external_user_iommu_id); >>>>>>>>>>>> + >>>>>>>>>>>> + return ret; >>>>>>>>>>>> +} >>>>>>>>>>>> + >>>>>>>>>>>> static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group) >>>>>>>>>>>> { >>>>>>>>>>>> long (*fn)(struct vfio_group *, unsigned long); >>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct >>>>>>>> kvm_device *dev) >>>>>>>>>>>> mutex_unlock(&kv->lock); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm, >>>>>>>>>>>> + struct vfio_group *vfio_group) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach >>>>>>>>>>> function? >>>>>>>>>> >>>>>>>>>> Having less patches which touch different maintainers areas is better. I >>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in >>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE >>>>>>>>>> table". >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> +{ >>>>>>>>>>>> + int group_id; >>>>>>>>>>>> + struct iommu_group *grp; >>>>>>>>>>>> + >>>>>>>>>>>> + group_id = kvm_vfio_external_user_iommu_id(vfio_group); >>>>>>>>>>>> + grp = iommu_group_get_by_id(group_id); >>>>>>>>>>>> + if (grp) { >>>>>>>>>>>> + kvm_spapr_tce_detach_iommu_group(kvm, grp); >>>>>>>>>>>> + iommu_group_put(grp); >>>>>>>>>>>> + } >>>>>>>>>>>> +} >>>>>>>>>>>> +#endif >>>>>>>>>>>> + >>>>>>>>>>>> static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg) >>>>>>>>>>>> { >>>>>>>>>>>> struct kvm_vfio *kv = dev->private; >>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device >>>>>>>> *dev, long attr, u64 arg) >>>>>>>>>>>> continue; >>>>>>>>>>>> >>>>>>>>>>>> list_del(&kvg->node); >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>>> >>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the >>>>>>>>>>> callsite. >>>>>>>>>> >>>>>>>>>> It is questionable. A x86 reader may decide that >>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get >>>>>>>>>> confused. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> + kvm_vfio_spapr_detach_iommu_group(dev->kvm, >>>>>>>>>>>> + kvg->vfio_group); >>>>>>>>>>>> +#endif >>>>>>>>>>>> kvm_vfio_group_put_external_user(kvg->vfio_group); >>>>>>>>>>>> kfree(kvg); >>>>>>>>>>>> ret = 0; >>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device >>>>>>>> *dev, long attr, u64 arg) >>>>>>>>>>>> kvm_vfio_update_coherency(dev); >>>>>>>>>>>> >>>>>>>>>>>> return ret; >>>>>>>>>>>> + >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU >>>>>>>>>>>> + case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: { >>>>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn param; >>>>>>>>>>>> + unsigned long minsz; >>>>>>>>>>>> + struct kvm_vfio *kv = dev->private; >>>>>>>>>>>> + struct vfio_group *vfio_group; >>>>>>>>>>>> + struct kvm_vfio_group *kvg; >>>>>>>>>>>> + struct fd f; >>>>>>>>>>>> + >>>>>>>>>>>> + minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn, >>>>>>>>>>>> + start_addr); >>>>>>>>>>>> + >>>>>>>>>>>> + if (copy_from_user(¶m, (void __user *)arg, minsz)) >>>>>>>>>>>> + return -EFAULT; >>>>>>>>>>>> + >>>>>>>>>>>> + if (param.argsz < minsz) >>>>>>>>>>>> + return -EINVAL; >>>>>>>>>>>> + >>>>>>>>>>>> + f = fdget(param.fd); >>>>>>>>>>>> + if (!f.file) >>>>>>>>>>>> + return -EBADF; >>>>>>>>>>>> + >>>>>>>>>>>> + vfio_group = kvm_vfio_group_get_external_user(f.file); >>>>>>>>>>>> + fdput(f); >>>>>>>>>>>> + >>>>>>>>>>>> + if (IS_ERR(vfio_group)) >>>>>>>>>>>> + return PTR_ERR(vfio_group); >>>>>>>>>>>> + >>>>>>>>>>>> + ret = -ENOENT; >>>>>>>>>>> >>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU? It's >>>>>>>>>>> possible a kernel could be built for a platform supporting multiple >>>>>>>>>>> IOMMU types. >>>>>>>>>> >>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a >>>>>>>>>> VFIO container property, not a group property and here (KVM) we only have >>>>>>>>>> groups. >>>>>>>>> >>>>>>>>> Which, as mentioned previously, is broken. >>>>>>>> >>>>>>>> Which I am failing to follow you on this. >>>>>>>> >>>>>>>> What I am trying to achieve here is pretty much referencing a group so it >>>>>>>> cannot be reused. Plus LIOBNs. >>>>>>> >>>>>>> "Plus LIOBNs" is not a trivial change. You are establishing a linkage >>>>>> >from LIOBNs to groups. But that doesn't make sense; if mapping in one >>>>>>> (guest) LIOBN affects a group it must affect all groups in the >>>>>>> container. i.e. LIOBN->container is the natural mapping, *not* LIOBN >>>>>>> to group. >>>>>> >>>>>> I can see your point but i don't see how to proceed now, I'm totally stuck. >>>>>> Pass container fd and then implement new api to lock containers somehow and >>>>> >>>>> I'm not really understanding what the question is about locking containers. >>>>> >>>>>> enumerate groups when updating TCE table (including real mode)? >>>>> >>>>> Why do you need to enumerate groups? The groups within the container >>>>> share a TCE table, so can't you just update that once? >>>> >>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache >>>> invalidate) register address, it is still per PE but this does not matter >>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete >>>> the picture. >>> >>> True, you'll need to enumerate the groups for invalidates. But you >>> need that already, right. >>> >>>>>> Plus new API when we remove a group from a container as the result of guest >>>>>> PCI hot unplug? >>>>> >>>>> I assume you mean a kernel internal API, since it shouldn't need >>>>> anything else visible to userspace. Won't this happen naturally? >>>>> When the group is removed from the container, it will get its own TCE >>>>> table instead of the previously shared one. >>>>> >>>>>>>> Passing a container fd does not make much >>>>>>>> sense here as the VFIO device would walk through groups, reference them and >>>>>>>> that is it, there is no locking on VFIO containters and so far there was no >>>>>>>> need to teach KVM about containers. >>>>>>>> >>>>>>>> What do I miss now? >>>>>>> >>>>>>> Referencing the groups is essentially just a useful side effect. The >>>>>>> important functionality is informing VFIO of the guest LIOBNs; and >>>>>>> LIOBNs map to containers, not groups. >>>>>> >>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There >>>>>> can be one or many or none containers per liobn. >>>>> >>>>> Ah, true. >>>> >>>> So I need to add new kernel API for KVM to get table(s) from VFIO >>>> container(s). And invent some locking mechanism to prevent table(s) (or >>>> associated container(s)) from going away, like >>>> vfio_group_get_external_user/vfio_group_put_external_user but for >>>> containers. Correct? >>> >>> Well, the container is attached to an fd, so if you get a reference on >>> the file* that should do it. >> >> I am still trying to think of how to implement this suggestion. >> >> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right >> interface as it knows nothing about KVM. There is VFIO-KVM device but it >> does not have idea about containers. >> >> So I have to: >> >> Wenever a container is created or removed, notify the VFIO-KVM device by >> passing there a container fd. ok. > > Actually, I don't think the vfio-kvm device is really useful here. It > was designed as a hack for a particular problem on x86. It certainly > could be extended to cover the information we need here, but I don't > think it's a particularly natural way of doing so. > > The problem is that conveying the information from the vfio-kvm device > to the real mode H_PUT_TCE handler, which is what really needs it, > isn't particularly simpler than conveying that information from > anywhere else. > >> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to >> what LIOBN so the realmode handlers could do the job. The real mode TCE >> handlers get LIOBN, find a guest view table and update it. Now I want to >> update the hardware table which is iommu_table attached to LIOBN. >> >> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd. >> >> Now VFIO-KVM device needs to extract iommu_table's from the container. >> These iommu_table pointers are stored in "struct tce_container" which is >> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I >> cannot export and use that. >> >> The other way to go would be adding API to VFIO to enumerate IOMMU groups >> in a container and use iommu_table pointers stored in iommu_table_group of >> each group (in fact the very first group will be enough as multiple groups >> in a container share the table). Adding vfio_container_get_groups() when >> only first one is needed is quite tricky in terms of maintainers approvals. >> >> So what would be the right course of action? Thanks. > > So, from the user side, you need to be able to bind a VFIO backend to > a particular guest IOMMU. This suggests a new ioctl() used in > conjunction with KVM_CREATE_SPAPR_TCE. Let's call it > KVM_SPAPR_TCE_BIND_VFIO. You'd use KVM_CREATE_SPAPR_TCE to make the > kernel aware of a LIOBN in the first place, then use > KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN. > So it would be a VM ioctl, taking a LIOBN and a container fd. You'd > need a capability to go with it, and some way to unbind as well. This is what I had in the first place some years ago. And after 5-6 reviews I was told that there is a VFIO KVM and I should use it. > To implement that, the ioctl() would need to use a new vfio (kernel > internal) interface - which can be specific to only the spapr TCE > type. That would take the container fd, and return the list of > iommu_tables in some form or other (or various error conditions, > obviously). > > So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform > the kernel of the LIOBN. When the VFIO device is attached to the PHB, > it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the > LIOBN. The ioctl() implementation uses the new special interface into > the spapr_tce vfio backend to get the list of iommu tables, which it > stores in some private format. Getting just a list of IOMMU groups would do too. Pushing such API is a problem, this is how I ended up with the current design. > The H_PUT_TCE implementation uses that > stored list of iommu tables to translate H_PUT_TCEs from the guest > into actions on the host IOMMU tables. > > And, yes, the special interface to the spapr TCE vfio back end is kind > of a hack. That's what you get when you need to link to separate > kernel subsystems for performance reasons. One can argue if it is a hack, how is this hack better that the existing approach? :) Alex, could you please comment on David's suggestion? Thanks! -- Alexey
Attachment:
signature.asc
Description: OpenPGP digital signature