Re: [RFC PATCH 01/10] s390/pci: base support

Sebastian Ott <sebott@xxxxxxxxxxxxxxxxxx> · Tue, 18 Dec 2012 19:07:18 +0100 (CET)

On Thu, 13 Dec 2012, Bjorn Helgaas wrote:

> On Thu, Dec 13, 2012 at 4:51 AM, Jan Glauber <jang@xxxxxxxxxxxxxxxxxx> wrote:
> > On Mon, 2012-12-10 at 14:14 -0700, Bjorn Helgaas wrote:
> >> On Wed, Nov 14, 2012 at 2:41 AM, Jan Glauber <jang@xxxxxxxxxxxxxxxxxx> wrote:
> >> > Add PCI support for s390, (only 64 bit mode is supported by hardware):
> >> > - PCI facility tests
> >> > - PCI instructions: pcilg, pcistg, pcistb, stpcifc, mpcifc, rpcit
> >> > - map readb/w/l/q and writeb/w/l/q to pcilg and pcistg instructions
> >> > - pci_iomap implementation
> >> > - memcpy_fromio/toio
> >> > - pci_root_ops using special pcilg/pcistg
> >> > - device, bus and domain allocation
> >> >
> >> > Signed-off-by: Jan Glauber <jang@xxxxxxxxxxxxxxxxxx>
> >>
> >> I think these patches are in -next already, so I just have some
> >> general comments & questions.
> >
> > Yes, since the feedback was manageable we decided to give the patches
> > some exposure in -next and if no one complains we'll just go for the
> > next merge window. BTW, Sebastian & Gerald (on CC:) will continue the
> > work on the PCI code.
> >
> >> My overall impression is that these are exceptionally well done.
> >> They're easy to read, well organized, and well documented.  It's a
> >> refreshing change from a lot of the stuff that's posted.
> >
> > Thanks Björn!
> >
> >> As with other arches that run on top of hypervisors, you have
> >> arch-specific code that enumerates PCI devices using hypervisor calls,
> >> and you hook into the PCI core at a lower level than
> >> pci_scan_root_bus().  That is, you call pci_create_root_bus(),
> >> pci_scan_single_device(), pci_bus_add_devices(), etc., directly from
> >> the arch code.  This is the typical approach, but it does make more
> >> dependencies between the arch code and the PCI core than I'd like.
> >>
> >> Did you consider hiding any of the hypervisor details behind the PCI
> >> config accessor interface?  If that could be done, the overall
> >> structure might be more similar to other architectures.
> >
> > You mean pci_root_ops? I'm not sure I understand how that can be used
> > to hide hipervisor details.
> 
> The object of doing this would be to let you use pci_scan_root_bus(),
> so you wouldn't have to call pci_scan_single_device() and
> pci_bus_add_resources() directly.  The idea is to make pci_read() and
> pci_write() smart enough that the PCI core can use them as though you
> have a normal PCI implementation.  When pci_scan_root_bus() enumerates
> devices on the root bus using pci_scan_child_bus(), it does config
> reads on the VENDOR_ID of bb:00.0, bb:01.0, ..., bb:1f.0.  Your
> pci_read() would return error or 0xffffffff data for everything except
> bb:00.0 (I guess it actually already does that).
> 
> Then some of the other init, e.g., zpci_enable_device(), could be done
> by the standard pcibios hooks such as pcibios_add_device() and
> pcibios_enable_device(), which would remove some of the PCI grunge
> from zpci_scan_devices() and the s390 hotplug driver.
> 
> > One reason why we use the lower level
> > functions is that we need to create the root bus (and its resources)
> > much earlier then the pci_dev. For instance pci_hp_register wants a
> > pci_bus to create the PCI slot and the slot can exist without a pci_dev.
> 
> There must be something else going on here; a bus is *always* created
> before any of the pci_devs on the bus.
> 
> One thing that looks a little strange is that zpci_list seems to be
> sort of a cross between a list of PCI host bridges and a list of PCI
> devices.  Understandable, since you usually have a one-to-one
> correspondence between them, but for your hotplug stuff, you can have
> the host bridge and slot without the pci_dev.
> 
> The hotplug slot support is really a function of the host bridge
> support, so it doesn't seem quite right to me to split it into a
> separate module (although that is the way most PCI hotplug drivers are
> currently structured).  One reason is that it makes hot-add of a host
> bridge awkward: you have to have the "if (hotplug_ops.create_slot)"
> test in zpci_create_device().
> 
> >> The current config accessors only work for dev/fn 00.0 (they fail when
> >> "devfn != ZPCI_DEVFN").  Why is that?  It looks like it precludes
> >> multi-function devices and basically prevents you from building an
> >> arbitrary PCI hierarchy.
> >
> > Our hypervisor does not support multi-function devices. In fact the
> > hypervisor will limit the reported PCI devices to a hand-picked
> > selection so we can be sure that there will be no unsupported devices.
> > The PCI hierarchy is hidden by the hipervisor. We only use the PCI
> > domain number, bus and devfn are always zero. So it looks like every
> > function is directly attached to a PCI root complex.
> >
> > That was the reason for the sanity check, but thinking about it I could
> > remove it since although we don't support multi-function devices I
> > think the s390 code should be more agnostic to these limitations.
> 
> The config accessor interface should be defined so it works for all
> PCI devices that exist, and fails for devices that do not exist.  The
> approach you've taken so far is to prevent the PCI core from even
> attempting access to non-existent devices, which requires you to use
> some lower-level PCI interfaces.  The alternative I'm raising as a
> possibility is to allow the PCI core to attempt accesses to
> non-existent devices, but have the accessor be smart enough to use the
> hypervisor or other arch-specific data to determine whether the device
> exists or not, and act accordingly.
> 
> Basically the idea is that if we put more smarts in the config
> accessors, we can make the interface between the PCI core and the
> architecture thinner.
> 
> >> zpci_map_resources() is very unusual.  The struct pci_dev resource[]
> >> table normally contains CPU physical addresses, but
> >> zpci_map_resources() fills it with virtual addresses.  I suspect this
> >> has something to do with the "BAR spaces are not disjunctive on s390"
> >> comment.  It almost sounds like that's describing host bridges where
> >> the PCI bus address is not equal to the CPU physical address -- in
> >> that case, device A and device B may have the same values in their
> >> BARs, but they can still be distinct if they're under host bridges
> >> that apply different CPU-to-bus address translations.
> >
> > Yeah, you've found it... I've had 3 or 4 tries on different
> > implementations but all failed. If we use the resources as they are we
> > cannot map them to the instructions (and ioremap does not help because
> > there we cannot find out which device the resource belongs to). If we
> > change the BARs on the card MMIO stops to work. I don't know about host
> > bridges - if we would use a host bridge at which point in the
> > translation process would it kick in?
> 
> Here's how it works.  The PCI host bridge claims regions of the CPU
> physical address space and forwards transactions in those regions to
> the PCI root bus.  Some host bridges can apply an offset when
> forwarding, so the address on the bus may be different from the
> address from the CPU.  The bus address is what matches a PCI BAR.
> 
> You can tell the PCI core about this translation by using
> pci_add_resource_offset() instead of pci_add_resource().  When the PCI
> core reads a BAR, it applies the offset to convert the BAR value into
> a CPU physical address.
> 
> For example, let's say you have two host bridges:
> 
> PCI host bridge to bus 0000:00
> pci_bus 0000:00: root bus resource [mem 0x100000000-0x1ffffffff] (bus
> address [0x00000000-0xffffffff])
> PCI host bridge to bus 0001:00
> pci_bus 0001:00: root bus resource [mem 0x200000000-0x2ffffffff] (bus
> address [0x00000000-0xffffffff])
> 
> Both bridges use the same bus address range, and BARs of devices on
> bus 0000:00 can have the same values as those of devices on bus
> 0001:00.  But there's no ambiguity because a CPU access to
> 0x1_0000_0000 will be claimed by the first bridge and translated to
> bus address 0x0 on bus 0000:00, while a CPU access to 0x2_0000_0000
> will be claimed by the second bridge and translated to bus address 0x0
> on bus 0001:00.
> 
> The pci_dev resources will contain CPU physical addresses, not the BAR
> values themselves.  These addresses implicitly identify the host
> bridge (and, in your case, the pci_dev, since you have at most one
> pci_dev per host bridge).

Hi Bjorn,

thanks for the explanation. I'll look into it.

Regards,
Sebastian