Re: [PATCH v4 1/2] iommu/s390: Fix race with release_device ops

Matthew Rosato <mjrosato@xxxxxxxxxxxxx> · Thu, 1 Sep 2022 16:28:35 -0400

On 9/1/22 5:37 AM, Niklas Schnelle wrote:
> On Thu, 2022-09-01 at 09:56 +0200, Pierre Morel wrote:
>>
>> On 8/31/22 22:12, Matthew Rosato wrote:
>>> With commit fa7e9ecc5e1c ("iommu/s390: Tolerate repeat attach_dev
>>> calls") s390-iommu is supposed to handle dynamic switching between IOMMU
>>> domains and the DMA API handling.  However, this commit does not
>>> sufficiently handle the case where the device is released via a call
>>> to the release_device op as it may occur at the same time as an opposing
>>> attach_dev or detach_dev since the group mutex is not held over
>>> release_device.  This was observed if the device is deconfigured during a
>>> small window during vfio-pci initialization and can result in WARNs and
>>> potential kernel panics.
>>>
>>> Handle this by tracking when the device is probed/released via
>>> dev_iommu_priv_set/get().  Ensure that once the device is released only
>>> release_device handles the re-init of the device DMA.
>>>
>>> Fixes: fa7e9ecc5e1c ("iommu/s390: Tolerate repeat attach_dev calls")
>>> Signed-off-by: Matthew Rosato <mjrosato@xxxxxxxxxxxxx>
>>> ---
>>>   arch/s390/include/asm/pci.h |  1 +
>>>   arch/s390/pci/pci.c         |  1 +
>>>   drivers/iommu/s390-iommu.c  | 39 ++++++++++++++++++++++++++++++++++---
>>>   3 files changed, 38 insertions(+), 3 deletions(-)
>>>
>>>
> ---8<---
>>>   
>>> @@ -206,10 +221,28 @@ static void s390_iommu_release_device(struct device *dev)
>>>
>>
> ---8<---
>>> +		/* Make sure this device is removed from the domain list */
>>>   		domain = iommu_get_domain_for_dev(dev);
>>>   		if (domain)
>>>   			s390_iommu_detach_device(domain, dev);
>>> +		/* Now ensure DMA is initialized from here */
>>> +		mutex_lock(&zdev->dma_domain_lock);
>>> +		if (zdev->s390_domain) {
>>> +			zdev->s390_domain = NULL;
>>> +			zpci_unregister_ioat(zdev, 0);
>>> +			zpci_dma_init_device(zdev);
>>
>> Sorry if it is a stupid question, but two things looks strange to me:
>>
>> - having DMA initialized just after having unregistered the IOAT
>> Is that really all we need to unregister before calling dma_init_device?

This is also how s390-iommu has been handling detach_dev (and still does)

>>
>> - having DMA initialized inside the release_device callback:
>> Why isn't it done in the device_probe ?
> 
> As I understand it iommu_release_device() which calls this code is only
> used when a device goes away. So, I think you're right in that it makes
> little sense to re-initialize DMA at this point, it's going to be torn
> down immediately after anyway. I do wonder if it would be an acceptably
> small change to just set zdev->s390_domain = NULL here and leave DMA
> uninitialized while making zpci_dma_exit_device() deal with that e.g.
> by doing nothing if zdev->dma_table is NULL but I'm not sure.

Right -- since it's a fix, I was trying to keep the changes minimal and this behavior (re-init DMA even on release_device) was existing, it was just always done within s390_iommu_detach_device before.

If you want, I could experiment with setting zdev->dma_table = NULL on the release path only (and checking it in zpci_dma_exit_device())

> 
> Either way I fear this mess really is just a symptom of our current
> design oddity of driving the same IOMMU hardware through both our DMA
> API implementation (arch/s390/pci_dma.c) and the IOMMU driver
> (driver/iommu/s390-iommu.c) and trying to hand off between them
> smoothly where common code instead just layers one atop the other when
> using an IOMMU at all.
> 
> I think the correct medium term solution is to use the common DMA API
> implementation (drivers/iommu/dma-iommu.c) like everyone else. But that
> isn't the minimal fix we need now. 

Agree

> 
> I do have a working prototype of using the common implementation but
> the big problem that I'm still searching a solution for is its
> performance with a virtualized IOMMU where IOTLB flushes (RPCIT on
> s390) are used for shadowing and are expensive and serialized. The
> optimization we used so far for unmap, only doing one global IOTLB
> flush once we run out of IOVA space, is just too much better in that
> scenario to just ignore. As one data point, on an NVMe I get about
> _twice_ the IOPS when using our existing scheme compared to strict
> mode. Which makes sense as IOTLB flushes are known as the bottleneck
> and optimizing unmap like that reduces them by almost half. Queued
> flushing is still much worse likely due to serialization of the
> shadowing, though again it works great on LPAR. To make sure it's not
> due to some bug in the IOMMU driver I even tried converting our
> existing DMA driver to layer on top of the IOMMU driver with the same
> result.
> 
>