On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote: > > On 17/04/2020 11:26, Russell Currey wrote: > > > > For what it's worth this sounds like a good idea to me, it just sounds > > tricky to implement. You're adding another layer of complexity on top > > of EEH (well, making things look simple to the EEH core and doing your > > own freezing on top of it) in addition to the DMA handling. > > > > If it works then great, just has a high potential to become a new bug > > haven. > > imho putting every PCI function to a separate PE is the right thing to > do here but I've been told it is not that simple, and I believe that. > Reusing slave PEs seems unreliable - the configuration will depend on > whether a PE occupied enough segments to give an unique PE to a PCI > function and my little brain explodes. You're overthinking it. If a bus has no m64 MMIO space we're free to assign whatever PE number we want since the PE for the bus isn't fixed by the m64 segment its BARs were placed in. For those buses we assign a PE number starting from the max and counting down (0xff, 0xfe, etc). For example, with this PHB: # lspci -s 1:: -v | egrep '0001:|Memory at' 0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode]) 0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) Memory at 600c081000000 (32-bit, non-prefetchable) [size=256K] 0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) 0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) 0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca) (prog-if 00 [Normal decode]) 0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device f117 (rev 06) (prog-if 02 [NVM Express]) Memory at 600c080000000 (64-bit, non-prefetchable) [size=16K] Memory at 6004000000000 (64-bit, prefetchable) [size=1M] 0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) Memory at 6004048000000 (64-bit, prefetchable) [size=8M] Memory at 600404a000000 (64-bit, prefetchable) [size=32K] (redundant functions removed) We get these PE assignments: 0001:00:00.0 - 0xfe # Root port 0001:01:00.0 - 0xfc # upstream port 0001:02:01.0 - 0xfd # downstream port bus 0001:02:08.0 - 0xfd 0001:02:09.0 - 0xfd 0001:03:00.0 - 0x0 # NVMe 0001:09:00.0 - 0x1 # Ethernet All the switch ports either have 32bit BARs or no BARs so they get assigned PEs starting from the top. The Ethernet and the NVMe have some prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1 respectively since that's where their m64 BARs are located. For our DMA-only slave PEs any MMIO space would remain in their master PE so we can allocate a PE number for the DMA-PE (our iommu context). I think the key thing to realise is that we'd only be using the DMA-PE when a crippled DMA mask is set by the driver. In all other cases we can just use the "native PE" and when the driver unbinds we can de- allocate our DMA-PE and return the device to the PE containing it's MMIO BARs. I think we can keep things relatively sane that way and the real issue is detecting EEH events on the DMA-PE. On P9 we don't have PHB error interrupts enabled in firmware so we're completely reliant on seeing a 0xFF response to an MMIO and manually checking the PE status to see if it's due to a PE freeze. For our DMA- PE it could be frozen (due to a bad DMA) and we'd never notice unless we explicitly check the status of the DMA-PE since there's no corresponding MMIO space to freeze. On P8 we had PHB Error interrupts so you would notice that *something* happened, then go check for frozen PEs, at which point the master-slave grouping logic would see that one PE in the group was frozen and freeze the rest of them. We can re-use that on that, but we still need something to actually notice a freeze occured. A background poller checking for freezes on each PE might do the trick. > So this is not happening soon. Oh ye of little faith. > For the time being, this patchset is good for: > 1. weird hardware which has limited DMA mask (this is why the patchset > was written in the first place) > 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled). Sure, but it's still dependent on having firmware which supports the 4GB hack and I don't think that's in any offical firmware releases yet. Oliver