Re: [PATCH v2 1/3] iommu/s390: Fix duplicate domain attachments

Niklas Schnelle <schnelle@xxxxxxxxxxxxx> · Tue, 27 Sep 2022 18:33:48 +0200

On Mon, 2022-09-26 at 10:46 -0300, Jason Gunthorpe wrote:
> On Mon, Sep 26, 2022 at 11:00:53AM +0200, Niklas Schnelle wrote:
> 
> > > > +out_unregister_restore:
> > > > +	zpci_unregister_ioat(zdev, 0);
> > > >  out_restore:
> > > > -	if (!zdev->s390_domain) {
> > > > +	zdev->dma_table = NULL;
> > > > +	if (prev_domain)
> > > > +		s390_iommu_attach_device(&prev_domain->domain,
> > > > +					 dev);
> > > 
> > > Huh. That is a surprising thing
> > > 
> > > I think this function needs some re-ordering to avoid this condition
> > > 
> > > The checks for aperture should be earlier, and they are not quite
> > > right. The aperture is only allowed to grow. If it starts out as 0 and
> > > then is set to something valid on first attach, a later attach cannot
> > > then shrink it. There could already be mappings in the domain under
> > > the now invalidated aperture and no caller is prepared to deal with
> > > this.
> > 
> > Ohh I think this is indeed broken. Let me rephrase to see if I
> > understand correctly. You're saying that while we only allow exactly
> > matching apertures on additional attaches, we do allow shrinking if
> > there is temporarily no device attached to the domain. That part is
> > then broken because there could still be mappings outside the new
> > aperture stored in the translation tables?
> 
> Right, go from 0 -> sized apperture on first attach, and then once it
> is sized it doesn't change again.

Ok, some background. This is currently largely a theoretical issue as
all native PCI devices that currently exist on s390 have the same zdev-
>start_dma (0x100000000) and we will set zdev->end_dma to (zdev-
>start_dma + MEMORY_SIZE) in zpci_dma_init_device() to save space for
our IOVA allocation bitmap. So currently no shrinking can occur.

Still I think we should do this properly. Also by the way, there is a
known off-by-one (aperture too small) in the current checks for which I
was about to send a patch as part of the conversion to using dma-
iommu.c.

>  
> > > That leaves the only error case as zpci_register_ioat() - which seems
> > > like it is the actual "attach" operation. Since
> > > __s390_iommu_detach_device() is just internal accounting (and can't
> > > fail) it should be moved after
> > > 
> > > So the logic order should be
> > > 
> > > 1) Attempt to widen the aperture, if this fails the domain is
> > >    incompatible bail immediately
> > 
> > Question. If the widening succeeds but we fail later during the attach
> > e.g. in 2) then the aperture remains widend or would that be rolled
> > back? 
> 
> I'd leave it widened.
> 
> IMHO I don't like this trick of setting the aperture on attach. It is
> logically wrong. The aperture is part of the configuration of the page
> table itself. The domain should know what page table format and thus
> apterture it has the moment it is allocated. Usually this is the
> related to the number of levels in the radix tree.
> 
> It seems to me that the issue here is trying to use the aperture when
> the reserved region is the appropriate tool.
> 
> eg I see that s390_domain_alloc calls dma_alloc_cpu_table() which just
> allocates a 3 level radix tree. This means it has a specific max
> address that can be passed to dma_walk_cpu_trans(). So the aperture
> should be fixed based on the radix tree parameters.

Yes, we even have the ZPCI_TABLE_SIZE_RT constant already that is the
maximum number of translations.

> 
> The device specific start/end should be represented as a reserved
> regions per-device. See patch below..
> 
> This is meaningful because it effects when VFIO can share the domains
> across devices. If devices have different reserved ranges we can still
> share domains, so long as no mapping is placed in the union of the
> reserved ranges. However if you vary the aperture, like is currently
> happening, then the domains become unsharable.

Ok, interesting idea.

As we haven't used these reserved regions so far and I'm not familiar
with what they are usually used for, I had a look at my x86_64 test box
(amd_iommu=on) and see the following:

# cat /sys/bus/pci/devices/*/iommu_group/reserved_regions | sort | uniq 
0x00000000fee00000 0x00000000feefffff msi
0x000000fd00000000 0x000000ffffffffff reserved

Not sure what the non-MSI reservation is for? It does seem like x86_64
also uses this for quite large ranges.

With your patch and the off by one fixed (see below) on an s390x
machine with 32 GiB memory I then get the following:

# cat /sys/bus/pci/devices/*/iommu_group/reserved_regions | sort | uniq
0x0000000000000000 0x00000000ffffffff reserved
0x0000000900000000 0x000003ffffffffff reserved

As the upper limit is shown inclusive this looks correct to me. 

Not all is rosy though, while this seems to work with the current code
and KVM pass-through, it breaks my conversion for using dma-iommu. 

This is because I'm getting a map request for an IOVA in the reserved
region. I pass the zdev->start_dma and zdev->end_dma to
iommu_setup_dma_ops() but while iommu_dma_init_domain() uses the lower
limit for the base_pfn it only checks if the upper limit fits in the
aperture and otherwise ignores it.  Looking at iommu_dma_alloc_iova() I
don't think that avoids the reserved regions either but it does use the
domain->geometry.aperture_end or dev->bus_dma_limit, whichever is
smaller.

With that knowledge setting dev->bus_dma_limit to zdev->end_dma makes
my conversion work again that feels wrong though but at least confirms
the issue.

> 
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index c898bcbbce118f..ba80325da76cd9 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -51,6 +51,12 @@ static bool s390_iommu_capable(enum iommu_cap cap)
>  	}
>  }
>  
> 
---8<---
>  
> +static void s390_iommu_get_resv_regions(struct device *dev,
> +					struct list_head *list)
> +{
> +	struct zpci_dev *zdev = to_zpci_dev(dev);
> +	struct iommu_resv_region *region;
> +
> +	if (zdev->start_dma) {
> +		region = iommu_alloc_resv_region(0, zdev->start_dma, 0,
> +						 IOMMU_RESV_RESERVED);
> +		if (!region)
> +			return;
> +		list_add_tail(&region->list, list);
> +	}
> +
> +	if (zdev->end_dma < MAX_DMA_TABLE_ADDR) {
> +		region = iommu_alloc_resv_region(
> +			zdev->end_dma, MAX_DMA_TABLE_ADDR - zdev->end_dma, 0,

Not your fault since there is a pre-existing bug in the range check but
I think there is an off-by-one error here as zdev->end_dma is the
highest usable address.

> +			IOMMU_RESV_RESERVED);
> +		if (!region)
> +			return;
> +		list_add_tail(&region->list, list);
> +	}
> +}
> +
>  static void s390_iommu_detach_device(struct iommu_domain *domain,
>  				     struct device *dev)
>  {
> @@ -376,6 +398,7 @@ static const struct iommu_ops s390_iommu_ops = {
>  	.release_device = s390_iommu_release_device,
>  	.device_group = generic_device_group,
>  	.pgsize_bitmap = S390_IOMMU_PGSIZES,
> +	.get_resv_regions = s390_iommu_get_resv_regions,
>  	.default_domain_ops = &(const struct iommu_domain_ops) {
>  		.attach_dev	= s390_iommu_attach_device,
>  		.detach_dev	= s390_iommu_detach_device,