On 02/07/2017 13:29, Bjorn Helgaas wrote: [snip] >> >> However, IP27 is completely different in this regard. Instead of using >> ioremapped addresses for I/O, IP27 has a dedicated address range, >> 0x92xxxxxxxxxxxxxx, that is used for all I/O access. Since this is >> uncached physical address space, the generic MIPS PCI code will not >> probe it correctly and thus, the original behavior of PCI_PROBE_ONLY >> needs to be restored only for the IP27 platform to bypass this logic >> and have working PCI, at least for the IO6/IO6G board that houses the >> base devices, until a better solution is found. > > It sounds like there's something different about how ioremap() works > on these platforms and PCI probing is tripping over that. I'd really > like to understand more about this difference to see if we can > converge that instead of adding back the PCI_PROBE_ONLY usage. > > Drivers shouldn't know whether they're running on IP27 or IP30, and > they should be using ioremap() in both cases. Does ioremap() work > differently on IP27 and IP30? Does this have something to do with > plat_ioremap() or fixup_bigphys_addr()? Okay, I think I have a rough idea on the differences with the I/O addresses on IP27 and IP30. It all boils down to the return value of a macro called NODE_SWIN_BASE, which operates differently on each platform. I don't fuilly understand //why// SGI did it a certain way, but HUB in IP27 gives you two sets of "windows" into Crosstalk space, "Big" (BWIN) and "Little" (LWIN). IP30's HEART chip has three sets of these windows (as I recall now from and appendix in the BRIDGE docs), Big (BWIN), medium (MWIN) and small (SWIN). So on IP27, HUB has seven BWIN spaces at 512MB each, and one LWIN space at 256MB (with a second LWIN that's aliased to the first). The LWIN is broken into sixteen 16MB spaces, which, I think, is what we're calling "Small windows" or SWIN's (Ralf, correct me if wrong). IP30's HEART allocates 64G of space per Xtalk widget for BWIN, 2G/widget for MWIN, and 16MB/widget for SWIN. As such, since HUB doesn't have an MWIN, you only find definitions of NODE_BWIN_BASE and NODE_SWIN_BASE in either IP27 or IP30's sources. Specifically, our focus is really only on NODE_SWIN_BASE, because the BWIN spaces need to be mapped into translation tables somewhere, and Linux lacks that capability. There is also a corresponding RAW_NODE_SWIN_BASE for both platforms that gives back the full address in uncached physical address space. IP30's definition of NODE_SWIN_BASE is: #define NODE_SWIN_BASE(nasid, widget) \ (0x0000000010000000 | (((unsigned long)(widget)) << 24)) And IP30's RAW_NODE_SWIN_BASE: #define RAW_NODE_SWIN_BASE(nasid, widget) \ (0x9000000010000000 | (((unsigned long)(widget)) << 24)) Contrast that to IP27's NODE_SWIN_BASE (for the __ASSEMBLY__ case) in arch/mips/include/asm/sn/sn0/addrs.h: #define NODE_SWIN_BASE(nasid, widget) \ (NODE_IO_BASE(nasid) + (UINT64_CAST(widget) << SWIN_SIZE_BITS)) (IP27's RAW_NODE_SWIN_BASE is the same as above) SWIN_SIZE_BITS is '24', so it matches IP30's. The difference is the value of NODE_IO_BASE that takes a nasid (NUMA address space identifier or node id) and returns the correct address in IP27's memory space to get to a specific node's I/O Base, which is 0x9200000000000000 plus the node ID. Merged with the widget ID shifted by 24 bits, and that's why we're getting the 0x92xx addresses. Basically, IP30 is returning the widget offset //without// the I/O base factored in, and that's why RAW_NODE_SWIN_BASE is provided to get the widget address in uncached physical address space. Whereas IP27's NODE_SWIN_BASE is an alias to its RAW_NODE_SWIN_BASE definition and turning 0x92xxx addresses. --- Then we look in arch/mips/include/asm/io.h at the function __ioremap_mode and see this construct near the top: if (cpu_has_64bit_addresses) { u64 base = UNCAC_BASE; /* * R10000 supports a 2 bit uncached attribute therefore * UNCAC_BASE may not equal IO_BASE. */ if (flags == _CACHE_UNCACHED) base = (u64) IO_BASE; return (void __iomem *) (unsigned long) (base + offset); For IP30, when we set up the BRIDGE parameters, we use NODE_SWIN_BASE to set the BRIDGE's MEM and IO start/end ranges and thus get back an address where the upper bits are all zero's. Which the PCI layer then fiddles with to yield an address like 0x000000f100000000. When that gets run through ioremap() by a driver's probe, we should be getting something like 0x900000f100000000, because IP30 uses the generic definition of IO_BASE defined in MIPS asm-generic/spaces.h. And the devices probe correctly. For IP27, I program BRIDGE the same way, but because IP27's NODE_SWIN_BASE is returning the 0x92xxx addresses in uncached space, my guess is ioremap(), as called by the drivers, is doubling that, yielding addresses like 0x24xxx, which unprotected, should generate an unhandled kernel unaligned access or such. Properly protected, attempts to access such an address probably just fail outright. --- So my thinking, which I'll test out over the weekend, is whether we can change IP27's NODE_SWIN_BASE to behave more like IP30's and just return the widget id shifted by 24 bits, then let ioremap() calls actually do the conversion to the 0x92xxx IO_BASE address space. There's only a handful of places in IP27's code and headers where NODE_SWIN_BASE is actually used, so I'll have to see which ones need to become RAW_NODE_SWIN_BASE (e.g., for direct HUB or XBOW access), and which can stay as NODE_SWIN_BASE. That would then allow the correct operation of ioremap() to return the 0x92xxx address space for drivers and I guess PCI to probe correctly. We'll find out! --J