Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB

"Oliver O'Halloran" <oohall@xxxxxxxxx> · Tue, 21 Apr 2020 00:04:11 +1000

On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
> 
> On 17/04/2020 11:26, Russell Currey wrote:
> > 
> > For what it's worth this sounds like a good idea to me, it just sounds
> > tricky to implement.  You're adding another layer of complexity on top
> > of EEH (well, making things look simple to the EEH core and doing your
> > own freezing on top of it) in addition to the DMA handling.
> > 
> > If it works then great, just has a high potential to become a new bug
> > haven.
> 
> imho putting every PCI function to a separate PE is the right thing to
> do here but I've been told it is not that simple, and I believe that.
> Reusing slave PEs seems unreliable - the configuration will depend on
> whether a PE occupied enough segments to give an unique PE to a PCI
> function and my little brain explodes.

You're overthinking it.

If a bus has no m64 MMIO space we're free to assign whatever PE number
we want since the PE for the bus isn't fixed by the m64 segment its
BARs were placed in. For those buses we assign a PE number starting
from the max and counting down (0xff, 0xfe, etc). For example, with
this PHB:

# lspci -s 1:: -v | egrep '0001:|Memory at'
0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
        Memory at 600c081000000 (32-bit, non-prefetchable) [size=256K]
0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
f117 (rev 06) (prog-if 02 [NVM Express])
        Memory at 600c080000000 (64-bit, non-prefetchable) [size=16K]
        Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
X710/X557-AT 10GBASE-T (rev 02)
        Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
        Memory at 600404a000000 (64-bit, prefetchable) [size=32K]
(redundant functions removed)

We get these PE assignments:

0001:00:00.0 - 0xfe # Root port
0001:01:00.0 - 0xfc # upstream port
0001:02:01.0 - 0xfd # downstream port bus
0001:02:08.0 - 0xfd
0001:02:09.0 - 0xfd
0001:03:00.0 - 0x0  # NVMe
0001:09:00.0 - 0x1  # Ethernet

All the switch ports either have 32bit BARs or no BARs so they get
assigned PEs starting from the top. The Ethernet and the NVMe have some
prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
respectively since that's where their m64 BARs are located. For our
DMA-only slave PEs any MMIO space would remain in their master PE so we
can allocate a PE number for the DMA-PE (our iommu context).

I think the key thing to realise is that we'd only be using the DMA-PE
when a crippled DMA mask is set by the driver. In all other cases we
can just use the "native PE" and when the driver unbinds we can de-
allocate our DMA-PE and return the device to the PE containing it's
MMIO BARs. I think we can keep things relatively sane that way and the
real issue is detecting EEH events on the DMA-PE.

On P9 we don't have PHB error interrupts enabled in firmware so we're
completely reliant on seeing a 0xFF response to an MMIO and manually
checking the PE status to see if it's due to a PE freeze. For our DMA-
PE it could be frozen (due to a bad DMA) and we'd never notice unless
we explicitly check the status of the DMA-PE since there's no
corresponding MMIO space to freeze.

On P8 we had PHB Error interrupts so you would notice that *something*
happened, then go check for frozen PEs, at which point the master-slave 
grouping logic would see that one PE in the group was frozen and freeze
the rest of them. We can re-use that on that, but we still need
something to actually notice a freeze occured. A background poller
checking for freezes on each PE might do the trick.

> So this is not happening soon.

Oh ye of little faith.

> For the time being, this patchset is good for:
> 1. weird hardware which has limited DMA mask (this is why the patchset
> was written in the first place)
> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).

Sure, but it's still dependent on having firmware which supports the
4GB hack and I don't think that's in any offical firmware releases yet.

Oliver