On Tue, Sep 27, 2022 at 06:24:54PM +0200, Niklas Schnelle wrote: > On Mon, 2022-09-26 at 10:51 -0300, Jason Gunthorpe wrote: > > On Mon, Sep 26, 2022 at 03:29:49PM +0200, Niklas Schnelle wrote: > > > I did miss a problem in my initial answer. While zpci_register_ioat() > > > is indeed the actual "attach" operation, it assumes that at that point > > > no DMA address translations are registered. In that state DMA is > > > blocked of course. With that zpci_register_ioat() needs to come after > > > the zpci_unregister_ioat() that is part of __s390_iommu_detach_device() > > > and zpci_dma_exit_device(). If we do call those though we fundamentally > > > need to restore the previous domain / DMA API state on any subsequent > > > failure. If we don't restore we would leave the device detached from > > > any domain with DMA blocked. I wonder if this could be an acceptable > > > failure state though? It's safe as no DMA is possible and we could get > > > out of it with a successful attach. > > > > Is this because of that FW call it does? > > > > It seems like an FW API misdesign to not allow an unfailable change of > > translation, if the FW forces an unregister first then there are > > always error cases you can't correctly recover from. > > > > IMHO if the FW fails an attach you are just busted, there is no reason > > to think it would suddenly accept attaching the old domain just > > because it has a different physical address, right? > > While I can't come up with a case where an immediate retry would > actually help, there is at least one error case that one should be able > to recover from. The attach can fail if a firmware driven PCI device > recovery is in progress. Now if that happens during a switch between > domains I think one will have to do the equivalent of > > echo 0 > /sys/bus/pci/slots/<dev>/power; echo 1 > /.../power > > That of course tears down the whole PCI device so we don't have to > answer the question if the old or new domain is the active one. > > So I think in the consequences you're still right, attempting to re- > attach is a lot of hassle for little or no gain. I don't know about FW driven PCI device recovery, but if FW can trigger some behavior that causes the kernel driver to malfunction, (and not having a DMA domain attached is malfunctioning) then that is a WARN_ON condition, IMHO. Especially if the FW driver recovery is done co-operatively with a driver, then it is reasonable to demand no domains change while recovery is ongoing. Regardless, I still think this is a bug in the FW - it doesn't make sense that a domain can be attached when FW device recovery starts, and that the kernel can't change the domain while the FW device recovery is ongoing. Presumably FW should block DMA during recovery and just book-keep what the domain should be post-recovery. Keep in mind, we already have some WARN_ONs on this path: void iommu_group_release_dma_owner(struct iommu_group *group) { ret = __iommu_group_set_domain(group, group->default_domain); WARN(ret, "iommu driver failed to attach the default domain"); Attach is really is not something that is able to fail in normal cases.. > and the device is in a defined and isolated state. Maybe in the future > we could even make this explicit and attach to the blocking domain on > failed attach, does that make sense? This would help somewhat, if the blocking domain is properly recorded as the current group domain then things like iommu_device_use_default_domain() will fail, meaning no drivers can be attached to the device until it is hotpluged in/out. However, this is hard to do properly because multi-device groups cannot tolerate devices within the group having different current domains. Overall changing to the blocking domain or back to the default domain should never fail, drivers should be designed with this in mind. If fixing the FW is not feasible perhaps the better error recovery is to sleep/retry until it succeeds. Jason