On 9/2/22 1:21 PM, Jason Gunthorpe wrote: > On Fri, Sep 02, 2022 at 01:11:09PM -0400, Matthew Rosato wrote: >> On 9/1/22 4:37 PM, Jason Gunthorpe wrote: >>> On Thu, Sep 01, 2022 at 12:14:24PM -0400, Matthew Rosato wrote: >>>> On 9/1/22 6:25 AM, Robin Murphy wrote: >>>>> On 2022-08-31 21:12, Matthew Rosato wrote: >>>>>> With commit fa7e9ecc5e1c ("iommu/s390: Tolerate repeat attach_dev >>>>>> calls") s390-iommu is supposed to handle dynamic switching between IOMMU >>>>>> domains and the DMA API handling. However, this commit does not >>>>>> sufficiently handle the case where the device is released via a call >>>>>> to the release_device op as it may occur at the same time as an opposing >>>>>> attach_dev or detach_dev since the group mutex is not held over >>>>>> release_device. This was observed if the device is deconfigured during a >>>>>> small window during vfio-pci initialization and can result in WARNs and >>>>>> potential kernel panics. >>>>> >>>>> Hmm, the more I think about it, something doesn't sit right about this whole situation... release_device is called via the notifier from device_del() after the device has been removed from its parent bus and largely dismantled; it should definitely not still have a driver bound by that point, so how is VFIO doing things that manage to race at all? >>>>> >>>>> Robin. >>>> >>>> So, I generally have seen the issue manifest as one of the calls >>>> into the iommu core from __vfio_group_unset_container >>>> (e.g. iommu_deatch_group via vfio_type1_iommu) failing with a WARN. >>>> This happens when the vfio group fd is released, which could be >>>> coming e.g. from a userspace ioctl VFIO_GROUP_UNSET_CONTAINER. >>>> AFAICT there's nothing serializing the notion of calling into the >>>> iommu core here against a device that is simultaneously going >>>> through release_device (because we don't enter release_device with >>>> the group mutex held), resulting in unpredictable behavior between >>>> the dueling attach_dev/detach_dev and the release_device for >>>> s390-iommu at least. >>> >>> Oh, this is a vfio bug. >> >> I've been running with your diff applied today on s390 and this >> indeed fixes the issue by preventing the detach-after-release coming >> out of vfio. > > Heh, I'm shocked it worked at all > > I've been trying to understand Robin's latest remarks because maybe I > don't really understand your situation right. > > IMHO this is definately a VFIO bug, because in a single-device group > we must not allow the domain to remain attached past remove(). Or more > broadly we shouldn't be holding ownership of a group without also > having a driver attached.> > But this dicussion with Robin about multi-device groups and hotplug > makes me wonder what your situation is? There is certainly something > interesting there too, and this can't be a solution to that problem. > It's just a single-device group, that's all we do in s390 today. Of course, as we move forward to consume s390-iommu for other use-cases (e.g. Niklas DMA conversion) then yeah I think we will still need something within s390-iommu to handle competing e.g. attach/detach and release_device calls. For this particular issue, the scenario is triggered by deconfiguring that one host device (effectively, pull the plug) while the device is passed to QEMU via vfio-pci. >> Can you send as a patch for review? > > After I wrote this I had a better idea, to avoid the completion and > just fully orphan the group fd. > > And the patch is kind of messy > > Can you forward me the backtrace you hit also? It manifests in a few different ways (depends on the timing of the iommu_detach_group vs release_device), but this is by far the most common: [ 402.718355] iommu driver failed to attach the default/blocking domain [ 402.718380] WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80 [ 402.718389] Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4 [ 402.718439] CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G W 6.0.0-rc3 #5 [ 402.718442] Hardware name: IBM 3931 A01 782 (LPAR) [ 402.718443] Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80) [ 402.718447] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 [ 402.718450] Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0 [ 402.718453] 00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58 [ 402.718455] 00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200 [ 402.718457] 00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98 [ 402.718464] Krnl Code: 000000095bb10d18: c020003d56fc larl %r2,000000095c2bbb10 000000095bb10d1e: c0e50019d901 brasl %r14,000000095be4bf20 #000000095bb10d24: af000000 mc 0,0 >000000095bb10d28: b904002a lgr %r2,%r10 000000095bb10d2c: ebaff0a00004 lmg %r10,%r15,160(%r15) 000000095bb10d32: c0f4001aa867 brcl 15,000000095be65e00 000000095bb10d38: c004002168e0 brcl 0,000000095bf3def8 000000095bb10d3e: eb6ff0480024 stmg %r6,%r15,72(%r15) [ 402.718532] Call Trace: [ 402.718534] [<000000095bb10d28>] iommu_detach_group+0x70/0x80 [ 402.718538] ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80) [ 402.718540] [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1] [ 402.718546] [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio] [ 402.718552] [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio] [ 402.718557] pci 0004:00:00.0: Removing from iommu group 4 [ 402.718555] [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100 [ 402.718562] [<000000095be5d3b4>] __do_syscall+0x1d4/0x200 [ 402.718566] [<000000095be6c072>] system_call+0x82/0xb0 [ 402.718570] Last Breaking-Event-Address: [ 402.718571] [<000000095be4bf80>] __warn_printk+0x60/0x68 The WARN is hit because the attach fails due to the simultaneously in-flight release_device. The subject patch was basically attempting to serialize the value that was being stomped over within s390-iommu (zdev->s390_domain) as a result of the competing iommu_detach_group + release_device. By ensuring the vfio group is cleaned up before the release, I no longer see the competing threads, with release_device happening after the group is cleaned up. > > (Though I'm not sure I can get to this promptly, I have only 4 working > days before LPC and still many things to do) > > Thanks, > Jason