Re: [PATCH v3 2/5] PCI: Take bridge window alignment into account when distributing resources

Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx> · Sat, 31 Mar 2018 12:18:41 +0300

On Thu, Mar 29, 2018 at 04:29:21PM -0500, Bjorn Helgaas wrote:
> On Wed, Mar 28, 2018 at 03:01:20PM +0300, Mika Westerberg wrote:
> > On Tue, Mar 27, 2018 at 02:23:18PM -0500, Bjorn Helgaas wrote:
> > > On Mon, Feb 26, 2018 at 04:21:09PM +0300, Mika Westerberg wrote:
> > > > When connecting a Thunderbolt endpoint which itself has internal PCIe
> > > > switch (not the integrated one) the way we currently distribute
> > > 
> > > I don't understand this distinction between "internal PCIe switch" and
> > > "integrated one".
> > 
> > I tried to explain here that Thunderbolt endpoint itself has an internal
> > (integrated) PCIe switch.
> 
> I was confused by this because I thought you meant "Endpoint" in the
> sense used in the PCIe spec, i.e., something with a type 0 config
> header.  But you're using "endpoint" to mean "some collection of
> devices with a Thunderbolt interface on one side".

Yeah, I think the official term is "Thunderbolt endpoint" but I admit
it is confusing here.

> > > >From a PCI point of view, I assume you're adding a switch with 4 NIC
> > > endpoints below it.
> > 
> > Yes, exactly.
> > 
> > The PCIe device in question is Intel Gigabit ET2 Quad Port Server
> > adapter.
> > 
> > > > resources does not always work well because devices below
> > > > non-hotplug bridges might need to have their MMIO resources
> > > > aligned to something else than 1 MB boundary. As an example
> > > > connecting a server grade 4-port GbE add in card adds another
> > > > PCIe switch leading to GbE devices that want to have their MMIO
> > > > resources aligned to 2 MB boundary instead.
> > > 
> > > Is there something unique about Thunderbolt here?  I assume from a
> > > PCI point of view, these NIC ports have BARs that are 2 MB or
> > > larger and hence need 2 MB or larger alignment?
> > 
> > No, this is not unique to Thunderbolt in any way.
> 
> I think this is a case of confusing things by mentioning irrelevant
> details, e.g., leading off with "When connecting a Thunderbolt
> endpoint" when the problem can occur when hot-adding any PCIe device
> (or maybe only when adding a bridge?  I don't know).
> 
> It's good to mention a specific example, but I think it's best if you
> can describe the situation as generically as possible first, then cite
> the case at hand in a way that it's obvious this is just one of many
> possibilities.

OK, I'll rework the changelog accordingly.

> > > I don't understand how this can work.  You have this:
> > > 
> > >   for_each_pci_bridge(dev, bus) {
> > >     if (!hotplug_bridges && normal_bridges == 1) {
> > >       pci_bus_distribute_available_resources(...);   # block 1
> > >     } else if (dev->is_hotplug_bridge) {
> > >       ...                                            # block 2
> > >       pci_bus_distribute_available_resources(...);
> > >     } else {
> > >       ...                                            # block 3
> > >     }
> > >   }
> > > 
> > > This patch adds block 3.  Block 3 doesn't do anything with side
> > > effects, i.e., it only updates local variables like io_start,
> > > remaining_io, mmio_start, remaining_mmio, etc.
> > 
> > It updates remaining_* so that the next bridge resource is aligned
> > properly (if there is going to be one).
> 
> What do you mean by "aligned properly"?  The spec requires that bridge
> memory windows be 1MB aligned and I/O windows 4KB aligned.  Does the
> current code do that wrong?

Yes it does. I'll put this to the changelog but in short here are
the two bridges where you can see the alignment:

  pci 0000:39:00.0: BAR 14: assigned [mem 0x53300000-0x6a0fffff]
                                          ^^^^^^^^^^
  pci 0000:3a:01.0: BAR 14: assigned [mem 0x53400000-0x547fffff]
                                          ^^^^^^^^^^

The 3a:01.0 (downstream port) is behind 39:00.0 (upstream port) and
leads to the other PCIe switch with 4 NICs. The starting address is
aligned to 2 MB instead of 1 MB. Next we try to assign remaining
resources to the hotplug downstream port 3a:04.0:

   pci 0000:3a:04.0: BAR 14: no space for [mem size 0x15a00000]
   pci 0000:3a:04.0: BAR 14: failed to assign [mem size 0x15a00000]

but since we calculated the remaining MMIO space (0x15a00000) without
taking this alignment into account it does not fit to the [mem
0x53300000-0x6a0fffff] bridge window:

  start = 0x54800000
  end = 0x54800000 + 0x15a00000 - 1 = 0x6a1fffff

And 0x6a1fffff > 0x6a0fffff which leads to the failure.

> > > Those may be used in subsequent iterations, but it's only useful
> > > if there *are* subsequent iterations, so this is dependent on the
> > > order we iterate through the bridges.  It seems like we want the
> > > same results no matter what the order.
> > 
> > Here again we walk over the bridges using recursive calls so we need
> > to know upfront the alignment before we do next call to
> > pci_bus_distribute_available_resources() and that information is
> > only available when previous bridge is already handled. I suppose
> > this can be done in a better way but unfortunately I have no idea
> > how :-( Can you suggest something concrete I could do instead if
> > this is not sufficient?
> 
> I don't understand this well enough to have any great suggestions.

OK, thanks anyway.

> I did notice that the "!hotplug_bridges && normal_bridges == 1" case
> doesn't look like it needs to be in the for_each_pci_bridge() loop
> because there's only one bridge (I think "hotplug_bridges == 0 &&
> normal_bridges == 1" would be slightly easier to read).
> 
> And I suspect you might want to handle the "hotplug_bridges == 1 &&
> normal_bridges == 0" case identically.  That would mean you could pull
> these "list_is_singular()" cases out of the loop, which makes sense
> because if there's only one bridge, it should get everything.

That's a good idea. I'll do that in the next revision.