On Thu, 13 Dec 2012, Bjorn Helgaas wrote: > On Thu, Dec 13, 2012 at 4:51 AM, Jan Glauber <jang@xxxxxxxxxxxxxxxxxx> wrote: > > On Mon, 2012-12-10 at 14:14 -0700, Bjorn Helgaas wrote: > >> On Wed, Nov 14, 2012 at 2:41 AM, Jan Glauber <jang@xxxxxxxxxxxxxxxxxx> wrote: > >> > Add PCI support for s390, (only 64 bit mode is supported by hardware): > >> > - PCI facility tests > >> > - PCI instructions: pcilg, pcistg, pcistb, stpcifc, mpcifc, rpcit > >> > - map readb/w/l/q and writeb/w/l/q to pcilg and pcistg instructions > >> > - pci_iomap implementation > >> > - memcpy_fromio/toio > >> > - pci_root_ops using special pcilg/pcistg > >> > - device, bus and domain allocation > >> > > >> > Signed-off-by: Jan Glauber <jang@xxxxxxxxxxxxxxxxxx> > >> > >> I think these patches are in -next already, so I just have some > >> general comments & questions. > > > > Yes, since the feedback was manageable we decided to give the patches > > some exposure in -next and if no one complains we'll just go for the > > next merge window. BTW, Sebastian & Gerald (on CC:) will continue the > > work on the PCI code. > > > >> My overall impression is that these are exceptionally well done. > >> They're easy to read, well organized, and well documented. It's a > >> refreshing change from a lot of the stuff that's posted. > > > > Thanks Björn! > > > >> As with other arches that run on top of hypervisors, you have > >> arch-specific code that enumerates PCI devices using hypervisor calls, > >> and you hook into the PCI core at a lower level than > >> pci_scan_root_bus(). That is, you call pci_create_root_bus(), > >> pci_scan_single_device(), pci_bus_add_devices(), etc., directly from > >> the arch code. This is the typical approach, but it does make more > >> dependencies between the arch code and the PCI core than I'd like. > >> > >> Did you consider hiding any of the hypervisor details behind the PCI > >> config accessor interface? If that could be done, the overall > >> structure might be more similar to other architectures. > > > > You mean pci_root_ops? I'm not sure I understand how that can be used > > to hide hipervisor details. > > The object of doing this would be to let you use pci_scan_root_bus(), > so you wouldn't have to call pci_scan_single_device() and > pci_bus_add_resources() directly. The idea is to make pci_read() and > pci_write() smart enough that the PCI core can use them as though you > have a normal PCI implementation. When pci_scan_root_bus() enumerates > devices on the root bus using pci_scan_child_bus(), it does config > reads on the VENDOR_ID of bb:00.0, bb:01.0, ..., bb:1f.0. Your > pci_read() would return error or 0xffffffff data for everything except > bb:00.0 (I guess it actually already does that). > > Then some of the other init, e.g., zpci_enable_device(), could be done > by the standard pcibios hooks such as pcibios_add_device() and > pcibios_enable_device(), which would remove some of the PCI grunge > from zpci_scan_devices() and the s390 hotplug driver. > > > One reason why we use the lower level > > functions is that we need to create the root bus (and its resources) > > much earlier then the pci_dev. For instance pci_hp_register wants a > > pci_bus to create the PCI slot and the slot can exist without a pci_dev. > > There must be something else going on here; a bus is *always* created > before any of the pci_devs on the bus. > > One thing that looks a little strange is that zpci_list seems to be > sort of a cross between a list of PCI host bridges and a list of PCI > devices. Understandable, since you usually have a one-to-one > correspondence between them, but for your hotplug stuff, you can have > the host bridge and slot without the pci_dev. > > The hotplug slot support is really a function of the host bridge > support, so it doesn't seem quite right to me to split it into a > separate module (although that is the way most PCI hotplug drivers are > currently structured). One reason is that it makes hot-add of a host > bridge awkward: you have to have the "if (hotplug_ops.create_slot)" > test in zpci_create_device(). > > >> The current config accessors only work for dev/fn 00.0 (they fail when > >> "devfn != ZPCI_DEVFN"). Why is that? It looks like it precludes > >> multi-function devices and basically prevents you from building an > >> arbitrary PCI hierarchy. > > > > Our hypervisor does not support multi-function devices. In fact the > > hypervisor will limit the reported PCI devices to a hand-picked > > selection so we can be sure that there will be no unsupported devices. > > The PCI hierarchy is hidden by the hipervisor. We only use the PCI > > domain number, bus and devfn are always zero. So it looks like every > > function is directly attached to a PCI root complex. > > > > That was the reason for the sanity check, but thinking about it I could > > remove it since although we don't support multi-function devices I > > think the s390 code should be more agnostic to these limitations. > > The config accessor interface should be defined so it works for all > PCI devices that exist, and fails for devices that do not exist. The > approach you've taken so far is to prevent the PCI core from even > attempting access to non-existent devices, which requires you to use > some lower-level PCI interfaces. The alternative I'm raising as a > possibility is to allow the PCI core to attempt accesses to > non-existent devices, but have the accessor be smart enough to use the > hypervisor or other arch-specific data to determine whether the device > exists or not, and act accordingly. > > Basically the idea is that if we put more smarts in the config > accessors, we can make the interface between the PCI core and the > architecture thinner. > > >> zpci_map_resources() is very unusual. The struct pci_dev resource[] > >> table normally contains CPU physical addresses, but > >> zpci_map_resources() fills it with virtual addresses. I suspect this > >> has something to do with the "BAR spaces are not disjunctive on s390" > >> comment. It almost sounds like that's describing host bridges where > >> the PCI bus address is not equal to the CPU physical address -- in > >> that case, device A and device B may have the same values in their > >> BARs, but they can still be distinct if they're under host bridges > >> that apply different CPU-to-bus address translations. > > > > Yeah, you've found it... I've had 3 or 4 tries on different > > implementations but all failed. If we use the resources as they are we > > cannot map them to the instructions (and ioremap does not help because > > there we cannot find out which device the resource belongs to). If we > > change the BARs on the card MMIO stops to work. I don't know about host > > bridges - if we would use a host bridge at which point in the > > translation process would it kick in? > > Here's how it works. The PCI host bridge claims regions of the CPU > physical address space and forwards transactions in those regions to > the PCI root bus. Some host bridges can apply an offset when > forwarding, so the address on the bus may be different from the > address from the CPU. The bus address is what matches a PCI BAR. > > You can tell the PCI core about this translation by using > pci_add_resource_offset() instead of pci_add_resource(). When the PCI > core reads a BAR, it applies the offset to convert the BAR value into > a CPU physical address. > > For example, let's say you have two host bridges: > > PCI host bridge to bus 0000:00 > pci_bus 0000:00: root bus resource [mem 0x100000000-0x1ffffffff] (bus > address [0x00000000-0xffffffff]) > PCI host bridge to bus 0001:00 > pci_bus 0001:00: root bus resource [mem 0x200000000-0x2ffffffff] (bus > address [0x00000000-0xffffffff]) > > Both bridges use the same bus address range, and BARs of devices on > bus 0000:00 can have the same values as those of devices on bus > 0001:00. But there's no ambiguity because a CPU access to > 0x1_0000_0000 will be claimed by the first bridge and translated to > bus address 0x0 on bus 0000:00, while a CPU access to 0x2_0000_0000 > will be claimed by the second bridge and translated to bus address 0x0 > on bus 0001:00. > > The pci_dev resources will contain CPU physical addresses, not the BAR > values themselves. These addresses implicitly identify the host > bridge (and, in your case, the pci_dev, since you have at most one > pci_dev per host bridge). Hi Bjorn, thanks for the explanation. I'll look into it. Regards, Sebastian