Re: [PATCH v6 0/7] vfio/type1: Add support for valid iova list management

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 24 May 2018 12:20:32 -0600

[Cc +Joerg: AMD-Vi observation towards the end]

On Wed, 18 Apr 2018 12:40:38 +0100
Shameer Kolothum <shameerali.kolothum.thodi@xxxxxxxxxx> wrote:

> This series introduces an iova list associated with a vfio 
> iommu. The list is kept updated taking care of iommu apertures,
> and reserved regions. Also this series adds checks for any conflict
> with existing dma mappings whenever a new device group is attached to
> the domain.
> 
> User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO
> ioctl capability chains. Any dma map request outside the valid iova
> range will be rejected.

Hi Shameer,

I ran into two minor issues in testing this series, both related to
mdev usage of type1.  First, in patch 5/7 when we try to validate a dma
map request:

> +static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
> +				dma_addr_t start, dma_addr_t end)
> +{
> +	struct list_head *iova = &iommu->iova_list;
> +	struct vfio_iova *node;
> +
> +	list_for_each_entry(node, iova, list) {
> +		if ((start >= node->start) && (end <= node->end))
> +			return true;
> +	}
> +
> +	return false;
> +}

A container with only an mdev device will have an empty list because it
has not backing iommu to set ranges or reserved regions, so any dma map
will fail.  I think this is resolved as follows:

--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1100,7 +1100,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
                        return true;
        }
 
-       return false;
+       return list_empty(&iommu->iova_list);
 }
 
ie. return false only if there was anything to test against.

The second issue is similar, patch 6/7 adds to VFIO_IOMMU_GET_INFO:

+		ret = vfio_iommu_iova_build_caps(iommu, &caps);
+		if (ret)
+			return ret;

And build_caps has:

+	list_for_each_entry(iova, &iommu->iova_list, list)
+		iovas++;
+
+	if (!iovas) {
+		ret = -EINVAL;

Therefore if the iova list is empty, as for mdevs, the use can no
longer even call VFIO_IOMMU_GET_INFO on the container, which is a
regression.  Again, I think the fix is simple:

@@ -2090,7 +2090,7 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
                iovas++;
 
        if (!iovas) {
-               ret = -EINVAL;
+               ret = 0;
                goto out_unlock;
        }
 
ie. build_caps needs to handle lack of an iova_list as a non-error.

Also, I wrote a small unit test to validate the iova list for my
systems[1].  With the above changes, my Intel test system gives expected
results:

# ./vfio-type1-iova-list /sys/bus/mdev/devices/c08db5ed-05d3-4b39-b150-438a18bc698f /sys/bus/pci/devices/0000:00:1b.0
---- Adding device: c08db5ed-05d3-4b39-b150-438a18bc698f ----
Initial info struct size: 0x18
No caps
---- Adding device: 0000:00:1b.0 ----
Initial info struct size: 0x18
Requested info struct size: 0x48
New info struct size: 0x48
argsz: 0x48, flags: 0x3, cap_offset: 0x18
	00: 4800 0000 0300 0000 00f0 ffff ffff ffff
	10: 1800 0000 0000 0000 0100 0100 0000 0000
	20: 0200 0000 0000 0000 0000 0000 0000 0000
	30: ffff dffe 0000 0000 0000 f0fe 0000 0000
	40: ffff ffff ffff ff01 
[cap id: 1, version: 1, next: 0x0]
Found type1 iova range version: 1
	00: 0000000000000000 - 00000000fedfffff
	01: 00000000fef00000 - 01ffffffffffffff

Adding an mdev device to the container results in no iova list, adding
the physical device updates to the expected set with the MSI range
excluded.

I was a little surprised by an AMD system:

# ./vfio-type1-iova-list /sys/bus/pci/devices/0000:01:00.0
---- Adding device: 0000:01:00.0 ----
Initial info struct size: 0x18
Requested info struct size: 0x58
New info struct size: 0x58
argsz: 0x58, flags: 0x3, cap_offset: 0x18
	00: 5800 0000 0300 0000 00f0 ffff 7fff ffff
	10: 1800 0000 0000 0000 0100 0100 0000 0000
	20: 0300 0000 0000 0000 0000 0000 0000 0000
	30: ffff dffe 0000 0000 0000 f0fe 0000 0000
	40: ffff ffff fc00 0000 0000 0000 0001 0000
	50: ffff ffff ffff ffff 
[cap id: 1, version: 1, next: 0x0]
Found type1 iova range version: 1
	00: 0000000000000000 - 00000000fedfffff
	01: 00000000fef00000 - 000000fcffffffff
	02: 0000010000000000 - ffffffffffffffff

The additional excluded range is the HyperTransport area (which for
99.9+% of use cases isn't going to be a problem) is a rather surprising
limit for the size of a VM we can use on AMD, just under 1TB.  I fully
expect we'll see a bug report from that and we should be thinking about
how to address it.  Otherwise, if the changes above look good to you
I'll incorporate them into their respective patches and push to my next
branch.  Thanks,

Alex

[1] https://gist.github.com/awilliam/25e39252ab28101a2d89ace1ff765fdd