On Mon, 2022-09-26 at 10:51 -0300, Jason Gunthorpe wrote: > On Mon, Sep 26, 2022 at 03:29:49PM +0200, Niklas Schnelle wrote: > > I did miss a problem in my initial answer. While zpci_register_ioat() > > is indeed the actual "attach" operation, it assumes that at that point > > no DMA address translations are registered. In that state DMA is > > blocked of course. With that zpci_register_ioat() needs to come after > > the zpci_unregister_ioat() that is part of __s390_iommu_detach_device() > > and zpci_dma_exit_device(). If we do call those though we fundamentally > > need to restore the previous domain / DMA API state on any subsequent > > failure. If we don't restore we would leave the device detached from > > any domain with DMA blocked. I wonder if this could be an acceptable > > failure state though? It's safe as no DMA is possible and we could get > > out of it with a successful attach. > > Is this because of that FW call it does? > > It seems like an FW API misdesign to not allow an unfailable change of > translation, if the FW forces an unregister first then there are > always error cases you can't correctly recover from. > > IMHO if the FW fails an attach you are just busted, there is no reason > to think it would suddenly accept attaching the old domain just > because it has a different physical address, right? While I can't come up with a case where an immediate retry would actually help, there is at least one error case that one should be able to recover from. The attach can fail if a firmware driven PCI device recovery is in progress. Now if that happens during a switch between domains I think one will have to do the equivalent of echo 0 > /sys/bus/pci/slots/<dev>/power; echo 1 > /.../power That of course tears down the whole PCI device so we don't have to answer the question if the old or new domain is the active one. So I think in the consequences you're still right, attempting to re- attach is a lot of hassle for little or no gain. > > So make it simple, leave it DMA blocked and throw a WARN_ON.. To me a WARN_ON() isn't appropriate here, as stated above there is at least one error case that is recoverable and doesn't necessarily indicate a progamming bug. Also while not recoverable the device having been surprise unplugged is another case where the attach failing is not necessarily a bug. It would also thus cause false panics for e.g. CI systems with PANIC_ON_WARN. That said yes I think leaving DMA blocked and the device detached from any domain is reasonable. That way the above recover scenario will work and the device is in a defined and isolated state. Maybe in the future we could even make this explicit and attach to the blocking domain on failed attach, does that make sense?