On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@xxxxxxxxx> wrote: > > On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > > > > Anyone? Is it totally useless or wrong approach? Thanks, > > I wouldn't say it's either, but I still hate it. > > The 4GB mode being per-PHB makes it difficult to use unless we force > that mode on 100% of the time which I'd prefer not to do. Ideally > devices that actually support 64bit addressing (which is most of them) > should be able to use no-translate mode when possible since a) It's > faster, and b) It frees up room in the TCE cache devices that actually > need them. I know you've done some testing with 100G NICs and found > the overhead was fine, but IMO that's a bad test since it's pretty > much the best-case scenario since all the devices on the PHB are in > the same PE. The PHB's TCE cache only hits when the TCE matches the > DMA bus address and the PE number for the device so in a multi-PE > environment there's a lot of potential for TCE cache trashing. If > there was one or two PEs under that PHB it's probably not going to > matter, but if you have an NVMe rack with 20 drives it starts to look > a bit ugly. > > That all said, it might be worth doing this anyway since we probably > want the software infrastructure in place to take advantage of it. > Maybe expand the command line parameters to allow it to be enabled on > a per-PHB basis rather than globally. Since we're on the topic I've been thinking the real issue we have is that we're trying to pick an "optimal" IOMMU config at a point where we don't have enough information to work out what's actually optimal. The IOMMU config is done on a per-PE basis, but since PEs may contain devices with different DMA masks (looking at you wierd AMD audio function) we're always going to have to pick something conservative as the default config for TVE#0 (64k, no bypass mapping) since the driver will tell us what the device actually supports long after the IOMMU configuation is done. What we really want is to be able to have separate IOMMU contexts for each device, or at the very least a separate context for the crippled devices. We could allow a per-device IOMMU context by extending the Master / Slave PE thing to cover DMA in addition to MMIO. Right now we only use slave PEs when a device's MMIO BARs extend over multiple m64 segments. When that happens an MMIO error causes the PHB to freezes the PE corresponding to one of those segments, but not any of the others. To present a single "PE" to the EEH core we check the freeze status of each of the slave PEs when the EEH core does a PE status check and if any of them are frozen, we freeze the rest of them too. When a driver sets a limited DMA mask we could move that device to a seperate slave PE so that it has it's own IOMMU context taylored to its DMA addressing limits. Thoughts? Oliver