On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote: > On 23/03/16 14:03, David Gibson wrote: > > On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote: > >> Uff, lost cc: list. Added back. Some comments below. > >> > >> > >> On 03/21/2016 04:19 PM, David Gibson wrote: > >>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote: > >>>> On March 15, 2016 17:29:26 David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> wrote: > >>>> > >>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote: > >>>>>> On 03/10/2016 04:21 PM, David Gibson wrote: > >>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote: > >>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote: > >>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote: > >>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap > >>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU > >>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and > >>>>>>>>>> linked to IOMMU groups by the user space. > >>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need > >>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping. > >>>>>>>>>> > >>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter > >>>>>>>>>> is added which accepts: > >>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware > >>>>>>>>>> TCE table; > >>>>>>>>>> - a LIOBN to assign to the found table. > >>>>>>>>>> > >>>>>>>>>> Before notifying KVM about new link, this check the group for being > >>>>>>>>>> registered with KVM device in order to release them at unexpected KVM > >>>>>>>>>> finish. > >>>>>>>>>> > >>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user > >>>>>>>>>> space. > >>>>>>>>>> > >>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it > >>>>>>>>>> link to a KVM module. > >>>>>>>>>> > >>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> > >>>>>>>>>> --- > >>>>>>>>>> Documentation/virtual/kvm/devices/vfio.txt | 21 +++++- > >>>>>>>>>> arch/powerpc/kvm/Kconfig | 1 + > >>>>>>>>>> arch/powerpc/kvm/Makefile | 5 +- > >>>>>>>>>> arch/powerpc/kvm/powerpc.c | 1 + > >>>>>>>>>> include/uapi/linux/kvm.h | 9 +++ > >>>>>>>>>> virt/kvm/vfio.c | 106 > >>>>>> +++++++++++++++++++++++++++++ > >>>>>>>>>> 6 files changed, 140 insertions(+), 3 deletions(-) > >>>>>>>>>> > >>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt > >>>>>> b/Documentation/virtual/kvm/devices/vfio.txt > >>>>>>>>>> index ef51740..c0d3eb7 100644 > >>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt > >>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt > >>>>>>>>>> @@ -16,7 +16,24 @@ Groups: > >>>>>>>>>> > >>>>>>>>>> KVM_DEV_VFIO_GROUP attributes: > >>>>>>>>>> KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking > >>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor > >>>>>>>>>> + for the VFIO group. > >>>>>>>>> > >>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which > >>>>>>>>> case it might be clearer to put them in a separate patch. > >>>>>>>>> > >>>>>>>>>> KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device > >>>>>> tracking > >>>>>>>>>> + kvm_device_attr.addr points to an int32_t file descriptor > >>>>>>>>>> + for the VFIO group. > >>>>>>>>>> > >>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor > >>>>>>>>>> -for the VFIO group. > >>>>>>>>>> + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group > >>>>>>>>>> + kvm_device_attr.addr points to a struct: > >>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn { > >>>>>>>>>> + __u32 argsz; > >>>>>>>>>> + __s32 fd; > >>>>>>>>>> + __u32 liobn; > >>>>>>>>>> + __u8 pad[4]; > >>>>>>>>>> + __u64 start_addr; > >>>>>>>>>> + }; > >>>>>>>>>> + where > >>>>>>>>>> + @argsz is the size of kvm_vfio_spapr_tce_liobn; > >>>>>>>>>> + @fd is a file descriptor for a VFIO group; > >>>>>>>>>> + @liobn is a logical bus id to be associated with the group; > >>>>>>>>>> + @start_addr is a DMA window offset on the IO (PCI) bus > >>>>>>>>> > >>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call > >>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group? > >>>>>>>> > >>>>>>>> > >>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit > >>>>>>>> and 64bit windows, this is why @start_addr is there. > >>>>>>>> > >>>>>>>> > >>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig > >>>>>>>>>> index 1059846..dfa3488 100644 > >>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig > >>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig > >>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64 > >>>>>>>>>> select KVM > >>>>>>>>>> select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE > >>>>>>>>>> select SPAPR_TCE_IOMMU if IOMMU_SUPPORT > >>>>>>>>>> + select KVM_VFIO if VFIO > >>>>>>>>>> ---help--- > >>>>>>>>>> Support running unmodified book3s_64 and book3s_32 guest kernels > >>>>>>>>>> in virtual machines on book3s_64 host processors. > >>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile > >>>>>>>>>> index 7f7b6d8..71f577c 100644 > >>>>>>>>>> --- a/arch/powerpc/kvm/Makefile > >>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile > >>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm > >>>>>>>>>> KVM := ../../../virt/kvm > >>>>>>>>>> > >>>>>>>>>> common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \ > >>>>>>>>>> - $(KVM)/eventfd.o $(KVM)/vfio.o > >>>>>>>>>> + $(KVM)/eventfd.o > >>>>>>>>> > >>>>>>>>> Please don't disable the VFIO device for the non-book3s case. I added > >>>>>>>>> it (even though it didn't do anything until now) so that libvirt > >>>>>>>>> wouldn't choke when it finds it's not available. Obviously the new > >>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device > >>>>>>>>> itself should be available always. > >>>>>>>> > >>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module. > >>>>>>>> > >>>>>>>> > >>>>>>>>>> CFLAGS_e500_mmu.o := -I. > >>>>>>>>>> CFLAGS_e500_mmu_host.o := -I. > >>>>>>>>>> @@ -87,6 +87,9 @@ endif > >>>>>>>>>> kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \ > >>>>>>>>>> book3s_xics.o > >>>>>>>>>> > >>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \ > >>>>>>>>>> + $(KVM)/vfio.o \ > >>>>>>>>>> + > >>>>>>>>>> kvm-book3s_64-module-objs += \ > >>>>>>>>>> $(KVM)/kvm_main.o \ > >>>>>>>>>> $(KVM)/eventfd.o \ > >>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > >>>>>>>>>> index 19aa59b..63f188d 100644 > >>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c > >>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c > >>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm > >>>>>> *kvm, long ext) > >>>>>>>>>> #ifdef CONFIG_PPC_BOOK3S_64 > >>>>>>>>>> case KVM_CAP_SPAPR_TCE: > >>>>>>>>>> case KVM_CAP_SPAPR_TCE_64: > >>>>>>>>>> + case KVM_CAP_SPAPR_TCE_VFIO: > >>>>>>>>>> case KVM_CAP_PPC_ALLOC_HTAB: > >>>>>>>>>> case KVM_CAP_PPC_RTAS: > >>>>>>>>>> case KVM_CAP_PPC_FIXUP_HCALL: > >>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > >>>>>>>>>> index 080ffbf..f1abbea 100644 > >>>>>>>>>> --- a/include/uapi/linux/kvm.h > >>>>>>>>>> +++ b/include/uapi/linux/kvm.h > >>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr { > >>>>>>>>>> #define KVM_DEV_VFIO_GROUP 1 > >>>>>>>>>> #define KVM_DEV_VFIO_GROUP_ADD 1 > >>>>>>>>>> #define KVM_DEV_VFIO_GROUP_DEL 2 > >>>>>>>>>> +#define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN 3 > >>>>>>>>>> > >>>>>>>>>> enum kvm_device_type { > >>>>>>>>>> KVM_DEV_TYPE_FSL_MPIC_20 = 1, > >>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type { > >>>>>>>>>> KVM_DEV_TYPE_MAX, > >>>>>>>>>> }; > >>>>>>>>>> > >>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn { > >>>>>>>>>> + __u32 argsz; > >>>>>>>>>> + __s32 fd; > >>>>>>>>>> + __u32 liobn; > >>>>>>>>>> + __u8 pad[4]; > >>>>>>>>>> + __u64 start_addr; > >>>>>>>>>> +}; > >>>>>>>>>> + > >>>>>>>>>> /* > >>>>>>>>>> * ioctls for VM fds > >>>>>>>>>> */ > >>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c > >>>>>>>>>> index 1dd087d..87c771e 100644 > >>>>>>>>>> --- a/virt/kvm/vfio.c > >>>>>>>>>> +++ b/virt/kvm/vfio.c > >>>>>>>>>> @@ -20,6 +20,10 @@ > >>>>>>>>>> #include <linux/vfio.h> > >>>>>>>>>> #include "vfio.h" > >>>>>>>>>> > >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>>>>>>>> +#include <asm/kvm_ppc.h> > >>>>>>>>>> +#endif > >>>>>>>>>> + > >>>>>>>>>> struct kvm_vfio_group { > >>>>>>>>>> struct list_head node; > >>>>>>>>>> struct vfio_group *vfio_group; > >>>>>>>>>> @@ -60,6 +64,22 @@ static void > >>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group) > >>>>>>>>>> symbol_put(vfio_group_put_external_user); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group) > >>>>>>>>>> +{ > >>>>>>>>>> + int (*fn)(struct vfio_group *); > >>>>>>>>>> + int ret = -1; > >>>>>>>>> > >>>>>>>>> Should this be -ESOMETHING? > >>>>>>>>> > >>>>>>>>>> + fn = symbol_get(vfio_external_user_iommu_id); > >>>>>>>>>> + if (!fn) > >>>>>>>>>> + return ret; > >>>>>>>>>> + > >>>>>>>>>> + ret = fn(vfio_group); > >>>>>>>>>> + > >>>>>>>>>> + symbol_put(vfio_external_user_iommu_id); > >>>>>>>>>> + > >>>>>>>>>> + return ret; > >>>>>>>>>> +} > >>>>>>>>>> + > >>>>>>>>>> static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group) > >>>>>>>>>> { > >>>>>>>>>> long (*fn)(struct vfio_group *, unsigned long); > >>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct > >>>>>> kvm_device *dev) > >>>>>>>>>> mutex_unlock(&kv->lock); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm, > >>>>>>>>>> + struct vfio_group *vfio_group) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Shouldn't this go in the same patch that introduced the attach > >>>>>>>>> function? > >>>>>>>> > >>>>>>>> Having less patches which touch different maintainers areas is better. I > >>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in > >>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE > >>>>>>>> table". > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> +{ > >>>>>>>>>> + int group_id; > >>>>>>>>>> + struct iommu_group *grp; > >>>>>>>>>> + > >>>>>>>>>> + group_id = kvm_vfio_external_user_iommu_id(vfio_group); > >>>>>>>>>> + grp = iommu_group_get_by_id(group_id); > >>>>>>>>>> + if (grp) { > >>>>>>>>>> + kvm_spapr_tce_detach_iommu_group(kvm, grp); > >>>>>>>>>> + iommu_group_put(grp); > >>>>>>>>>> + } > >>>>>>>>>> +} > >>>>>>>>>> +#endif > >>>>>>>>>> + > >>>>>>>>>> static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg) > >>>>>>>>>> { > >>>>>>>>>> struct kvm_vfio *kv = dev->private; > >>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device > >>>>>> *dev, long attr, u64 arg) > >>>>>>>>>> continue; > >>>>>>>>>> > >>>>>>>>>> list_del(&kvg->node); > >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>>>>>>> > >>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the > >>>>>>>>> callsite. > >>>>>>>> > >>>>>>>> It is questionable. A x86 reader may decide that > >>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get > >>>>>>>> confused. > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> + kvm_vfio_spapr_detach_iommu_group(dev->kvm, > >>>>>>>>>> + kvg->vfio_group); > >>>>>>>>>> +#endif > >>>>>>>>>> kvm_vfio_group_put_external_user(kvg->vfio_group); > >>>>>>>>>> kfree(kvg); > >>>>>>>>>> ret = 0; > >>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device > >>>>>> *dev, long attr, u64 arg) > >>>>>>>>>> kvm_vfio_update_coherency(dev); > >>>>>>>>>> > >>>>>>>>>> return ret; > >>>>>>>>>> + > >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU > >>>>>>>>>> + case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: { > >>>>>>>>>> + struct kvm_vfio_spapr_tce_liobn param; > >>>>>>>>>> + unsigned long minsz; > >>>>>>>>>> + struct kvm_vfio *kv = dev->private; > >>>>>>>>>> + struct vfio_group *vfio_group; > >>>>>>>>>> + struct kvm_vfio_group *kvg; > >>>>>>>>>> + struct fd f; > >>>>>>>>>> + > >>>>>>>>>> + minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn, > >>>>>>>>>> + start_addr); > >>>>>>>>>> + > >>>>>>>>>> + if (copy_from_user(¶m, (void __user *)arg, minsz)) > >>>>>>>>>> + return -EFAULT; > >>>>>>>>>> + > >>>>>>>>>> + if (param.argsz < minsz) > >>>>>>>>>> + return -EINVAL; > >>>>>>>>>> + > >>>>>>>>>> + f = fdget(param.fd); > >>>>>>>>>> + if (!f.file) > >>>>>>>>>> + return -EBADF; > >>>>>>>>>> + > >>>>>>>>>> + vfio_group = kvm_vfio_group_get_external_user(f.file); > >>>>>>>>>> + fdput(f); > >>>>>>>>>> + > >>>>>>>>>> + if (IS_ERR(vfio_group)) > >>>>>>>>>> + return PTR_ERR(vfio_group); > >>>>>>>>>> + > >>>>>>>>>> + ret = -ENOENT; > >>>>>>>>> > >>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU? It's > >>>>>>>>> possible a kernel could be built for a platform supporting multiple > >>>>>>>>> IOMMU types. > >>>>>>>> > >>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a > >>>>>>>> VFIO container property, not a group property and here (KVM) we only have > >>>>>>>> groups. > >>>>>>> > >>>>>>> Which, as mentioned previously, is broken. > >>>>>> > >>>>>> Which I am failing to follow you on this. > >>>>>> > >>>>>> What I am trying to achieve here is pretty much referencing a group so it > >>>>>> cannot be reused. Plus LIOBNs. > >>>>> > >>>>> "Plus LIOBNs" is not a trivial change. You are establishing a linkage > >>>> >from LIOBNs to groups. But that doesn't make sense; if mapping in one > >>>>> (guest) LIOBN affects a group it must affect all groups in the > >>>>> container. i.e. LIOBN->container is the natural mapping, *not* LIOBN > >>>>> to group. > >>>> > >>>> I can see your point but i don't see how to proceed now, I'm totally stuck. > >>>> Pass container fd and then implement new api to lock containers somehow and > >>> > >>> I'm not really understanding what the question is about locking containers. > >>> > >>>> enumerate groups when updating TCE table (including real mode)? > >>> > >>> Why do you need to enumerate groups? The groups within the container > >>> share a TCE table, so can't you just update that once? > >> > >> Well, they share a TCE table but they do not share TCE Kill (TCE cache > >> invalidate) register address, it is still per PE but this does not matter > >> here (pnv_pci_link_table_and_group() does that), just mentioned to complete > >> the picture. > > > > True, you'll need to enumerate the groups for invalidates. But you > > need that already, right. > > > >>>> Plus new API when we remove a group from a container as the result of guest > >>>> PCI hot unplug? > >>> > >>> I assume you mean a kernel internal API, since it shouldn't need > >>> anything else visible to userspace. Won't this happen naturally? > >>> When the group is removed from the container, it will get its own TCE > >>> table instead of the previously shared one. > >>> > >>>>>> Passing a container fd does not make much > >>>>>> sense here as the VFIO device would walk through groups, reference them and > >>>>>> that is it, there is no locking on VFIO containters and so far there was no > >>>>>> need to teach KVM about containers. > >>>>>> > >>>>>> What do I miss now? > >>>>> > >>>>> Referencing the groups is essentially just a useful side effect. The > >>>>> important functionality is informing VFIO of the guest LIOBNs; and > >>>>> LIOBNs map to containers, not groups. > >>>> > >>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There > >>>> can be one or many or none containers per liobn. > >>> > >>> Ah, true. > >> > >> So I need to add new kernel API for KVM to get table(s) from VFIO > >> container(s). And invent some locking mechanism to prevent table(s) (or > >> associated container(s)) from going away, like > >> vfio_group_get_external_user/vfio_group_put_external_user but for > >> containers. Correct? > > > > Well, the container is attached to an fd, so if you get a reference on > > the file* that should do it. > > I am still trying to think of how to implement this suggestion. > > I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right > interface as it knows nothing about KVM. There is VFIO-KVM device but it > does not have idea about containers. > > So I have to: > > Wenever a container is created or removed, notify the VFIO-KVM device by > passing there a container fd. ok. Actually, I don't think the vfio-kvm device is really useful here. It was designed as a hack for a particular problem on x86. It certainly could be extended to cover the information we need here, but I don't think it's a particularly natural way of doing so. The problem is that conveying the information from the vfio-kvm device to the real mode H_PUT_TCE handler, which is what really needs it, isn't particularly simpler than conveying that information from anywhere else. > Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to > what LIOBN so the realmode handlers could do the job. The real mode TCE > handlers get LIOBN, find a guest view table and update it. Now I want to > update the hardware table which is iommu_table attached to LIOBN. > > I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd. > > Now VFIO-KVM device needs to extract iommu_table's from the container. > These iommu_table pointers are stored in "struct tce_container" which is > local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I > cannot export and use that. > > The other way to go would be adding API to VFIO to enumerate IOMMU groups > in a container and use iommu_table pointers stored in iommu_table_group of > each group (in fact the very first group will be enough as multiple groups > in a container share the table). Adding vfio_container_get_groups() when > only first one is needed is quite tricky in terms of maintainers approvals. > > So what would be the right course of action? Thanks. So, from the user side, you need to be able to bind a VFIO backend to a particular guest IOMMU. This suggests a new ioctl() used in conjunction with KVM_CREATE_SPAPR_TCE. Let's call it KVM_SPAPR_TCE_BIND_VFIO. You'd use KVM_CREATE_SPAPR_TCE to make the kernel aware of a LIOBN in the first place, then use KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN. So it would be a VM ioctl, taking a LIOBN and a container fd. You'd need a capability to go with it, and some way to unbind as well. To implement that, the ioctl() would need to use a new vfio (kernel internal) interface - which can be specific to only the spapr TCE type. That would take the container fd, and return the list of iommu_tables in some form or other (or various error conditions, obviously). So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform the kernel of the LIOBN. When the VFIO device is attached to the PHB, it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the LIOBN. The ioctl() implementation uses the new special interface into the spapr_tce vfio backend to get the list of iommu tables, which it stores in some private format. The H_PUT_TCE implementation uses that stored list of iommu tables to translate H_PUT_TCEs from the guest into actions on the host IOMMU tables. And, yes, the special interface to the spapr TCE vfio back end is kind of a hack. That's what you get when you need to link to separate kernel subsystems for performance reasons. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Attachment:
signature.asc
Description: PGP signature