Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE

Alexey Kardashevskiy <aik@xxxxxxxxx> · Tue, 14 Jun 2016 13:30:53 +1000

On 10/06/16 16:50, David Gibson wrote:
> On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
>> On 23/03/16 14:03, David Gibson wrote:
>>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
>>>> Uff, lost cc: list. Added back. Some comments below.
>>>>
>>>>
>>>> On 03/21/2016 04:19 PM, David Gibson wrote:
>>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On March 15, 2016 17:29:26 David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>>>>>> is added which accepts:
>>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>>>>>> TCE table;
>>>>>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>>>>>
>>>>>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>>>>>> finish.
>>>>>>>>>>>>
>>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>>>>>> space.
>>>>>>>>>>>>
>>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>>>>>> link to a KVM module.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
>>>>>>>>>>>> ---
>>>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>>>>>  virt/kvm/vfio.c                            | 106
>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>>>>>
>>>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>>
>>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>>>>>
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>>>>>> tracking
>>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>>>
>>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> -for the VFIO group.
>>>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +			__u32	argsz;
>>>>>>>>>>>> +			__s32	fd;
>>>>>>>>>>>> +			__u32	liobn;
>>>>>>>>>>>> +			__u8	pad[4];
>>>>>>>>>>>> +			__u64	start_addr;
>>>>>>>>>>>> +		};
>>>>>>>>>>>> +		where
>>>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>>>>>
>>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>>>>>  	select KVM
>>>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>>>>>  	---help---
>>>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>>>>>  KVM := ../../../virt/kvm
>>>>>>>>>>>>
>>>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>>>>>
>>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>>>>>> itself should be available always.
>>>>>>>>>>
>>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
>>>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
>>>>>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>>>>>  	book3s_xics.o
>>>>>>>>>>>>
>>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>>>>>> +
>>>>>>>>>>>>  kvm-book3s_64-module-objs += \
>>>>>>>>>>>>  	$(KVM)/kvm_main.o \
>>>>>>>>>>>>  	$(KVM)/eventfd.o \
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>>>>>> *kvm, long ext)
>>>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
>>>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>>>>>
>>>>>>>>>>>>  enum kvm_device_type {
>>>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
>>>>>>>>>>>>  };
>>>>>>>>>>>>
>>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +	__u32	argsz;
>>>>>>>>>>>> +	__s32	fd;
>>>>>>>>>>>> +	__u32	liobn;
>>>>>>>>>>>> +	__u8	pad[4];
>>>>>>>>>>>> +	__u64	start_addr;
>>>>>>>>>>>> +};
>>>>>>>>>>>> +
>>>>>>>>>>>>  /*
>>>>>>>>>>>>   * ioctls for VM fds
>>>>>>>>>>>>   */
>>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>>>>>  #include <linux/vfio.h>
>>>>>>>>>>>>  #include "vfio.h"
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  struct kvm_vfio_group {
>>>>>>>>>>>>  	struct list_head node;
>>>>>>>>>>>>  	struct vfio_group *vfio_group;
>>>>>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>>>>>> +	int ret = -1;
>>>>>>>>>>>
>>>>>>>>>>> Should this be -ESOMETHING?
>>>>>>>>>>>
>>>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>>>>>> +	if (!fn)
>>>>>>>>>>>> +		return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	return ret;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>>>>>> kvm_device *dev)
>>>>>>>>>>>>  	mutex_unlock(&kv->lock);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>>>>>> function?
>>>>>>>>>>
>>>>>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>>>>>> table".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	int group_id;
>>>>>>>>>>>> +	struct iommu_group *grp;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>>>>>> +	if (grp) {
>>>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>>>>>> +		iommu_group_put(grp);
>>>>>>>>>>>> +	}
>>>>>>>>>>>> +}
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>  				continue;
>>>>>>>>>>>>
>>>>>>>>>>>>  			list_del(&kvg->node);
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>
>>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>>>>>> callsite.
>>>>>>>>>>
>>>>>>>>>> It is questionable. A x86 reader may decide that
>>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>>>>>> confused.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>>>>>> +					kvg->vfio_group);
>>>>>>>>>>>> +#endif
>>>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>>>>>  			kfree(kvg);
>>>>>>>>>>>>  			ret = 0;
>>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
>>>>>>>>>>>>
>>>>>>>>>>>>  		return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>>>>>> +		unsigned long minsz;
>>>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>>>>>> +		struct fd f;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>>>>>> +				start_addr);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>>>>>> +			return -EFAULT;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>>>>>> +			return -EINVAL;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		f = fdget(param.fd);
>>>>>>>>>>>> +		if (!f.file)
>>>>>>>>>>>> +			return -EBADF;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>>>>>> +		fdput(f);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		ret = -ENOENT;
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>>>>>> IOMMU types.
>>>>>>>>>>
>>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>>>>>> groups.
>>>>>>>>>
>>>>>>>>> Which, as mentioned previously, is broken.
>>>>>>>>
>>>>>>>> Which I am failing to follow you on this.
>>>>>>>>
>>>>>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>>>>>> cannot be reused. Plus LIOBNs.
>>>>>>>
>>>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>>>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>>>>>> (guest) LIOBN affects a group it must affect all groups in the
>>>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>>>>>> to group.
>>>>>>
>>>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>>>>>> Pass container fd and then implement new api to lock containers somehow and
>>>>>
>>>>> I'm not really understanding what the question is about locking containers.
>>>>>
>>>>>> enumerate groups when updating TCE table (including real mode)?
>>>>>
>>>>> Why do you need to enumerate groups?  The groups within the container
>>>>> share a TCE table, so can't you just update that once?
>>>>
>>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
>>>> invalidate) register address, it is still per PE but this does not matter
>>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
>>>> the picture.
>>>
>>> True, you'll need to enumerate the groups for invalidates.  But you
>>> need that already, right.
>>>
>>>>>> Plus new API when we remove a group from a container as the result of guest
>>>>>> PCI hot unplug?
>>>>>
>>>>> I assume you mean a kernel internal API, since it shouldn't need
>>>>> anything else visible to userspace.  Won't this happen naturally?
>>>>> When the group is removed from the container, it will get its own TCE
>>>>> table instead of the previously shared one.
>>>>>
>>>>>>>> Passing a container fd does not make much
>>>>>>>> sense here as the VFIO device would walk through groups, reference them and
>>>>>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>>>>>> need to teach KVM about containers.
>>>>>>>>
>>>>>>>> What do I miss now?
>>>>>>>
>>>>>>> Referencing the groups is essentially just a useful side effect.  The
>>>>>>> important functionality is informing VFIO of the guest LIOBNs; and
>>>>>>> LIOBNs map to containers, not groups.
>>>>>>
>>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>>>>>> can be one or many or none containers per liobn.
>>>>>
>>>>> Ah, true.
>>>>
>>>> So I need to add new kernel API for KVM to get table(s) from VFIO
>>>> container(s). And invent some locking mechanism to prevent table(s) (or
>>>> associated container(s)) from going away, like
>>>> vfio_group_get_external_user/vfio_group_put_external_user but for
>>>> containers. Correct?
>>>
>>> Well, the container is attached to an fd, so if you get a reference on
>>> the file* that should do it.
>>
>> I am still trying to think of how to implement this suggestion.
>>
>> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
>> interface as it knows nothing about KVM. There is VFIO-KVM device but it
>> does not have idea about containers.
>>
>> So I have to:
>>
>> Wenever a container is created or removed, notify the VFIO-KVM device by
>> passing there a container fd. ok.
> 
> Actually, I don't think the vfio-kvm device is really useful here.  It
> was designed as a hack for a particular problem on x86.  It certainly
> could be extended to cover the information we need here, but I don't
> think it's a particularly natural way of doing so.
> 
> The problem is that conveying the information from the vfio-kvm device
> to the real mode H_PUT_TCE handler, which is what really needs it,
> isn't particularly simpler than conveying that information from
> anywhere else.
> 
>> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
>> what LIOBN so the realmode handlers could do the job. The real mode TCE
>> handlers get LIOBN, find a guest view table and update it. Now I want to
>> update the hardware table which is iommu_table attached to LIOBN.
>>
>> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
>>
>> Now VFIO-KVM device needs to extract iommu_table's from the container.
>> These iommu_table pointers are stored in "struct tce_container" which is
>> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
>> cannot export and use that.
>>
>> The other way to go would be adding API to VFIO to enumerate IOMMU groups
>> in a container and use iommu_table pointers stored in iommu_table_group of
>> each group (in fact the very first group will be enough as multiple groups
>> in a container share the table). Adding vfio_container_get_groups() when
>> only first one is needed is quite tricky in terms of maintainers approvals.
>>
>> So what would be the right course of action? Thanks.
> 
> So, from the user side, you need to be able to bind a VFIO backend to
> a particular guest IOMMU.  This suggests a new ioctl() used in
> conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
> KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
> kernel aware of a LIOBN in the first place, then use
> KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
> So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
> need a capability to go with it, and some way to unbind as well.

This is what I had in the first place some years ago. And after 5-6 reviews
I was told that there is a VFIO KVM and I should use it.

> To implement that, the ioctl() would need to use a new vfio (kernel
> internal) interface - which can be specific to only the spapr TCE
> type.  That would take the container fd, and return the list of
> iommu_tables in some form or other (or various error conditions,
> obviously).
> 
> So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
> the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
> it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
> LIOBN.  The ioctl() implementation uses the new special interface into
> the spapr_tce vfio backend to get the list of iommu tables, which it
> stores in some private format.

Getting just a list of IOMMU groups would do too. Pushing such API is a
problem, this is how I ended up with the current design.

> The H_PUT_TCE implementation uses that
> stored list of iommu tables to translate H_PUT_TCEs from the guest
> into actions on the host IOMMU tables.
> 
> And, yes, the special interface to the spapr TCE vfio back end is kind
> of a hack.  That's what you get when you need to link to separate
> kernel subsystems for performance reasons.

One can argue if it is a hack, how is this hack better that the existing
approach? :)

Alex, could you please comment on David's suggestion? Thanks!

-- 
Alexey

Attachment:
signature.asc

Description: OpenPGP digital signature