Based on Bjorn's latest response it looks as if pci_acpi_scan_root() is only used as a fallback when the platform's BIOS has not put a proper ACPI SRAT table and/or _PXM method in place. It will be interesting to see if the 'dmesg' log and 'acpidump' information ends up proving this out. Regardless of the above, if I'm understanding your patches - specifically 1/3 - it looks as if the arch/x86/pci/amd_bus.c changes remove a lot of the CPU family specifics and yield a generally more generic approach that won't have to be constantly updated as new AMD CPU families are introduced going forward (exactly the situation we are now in). Today, before patch 1/3 here, amd_bus.c only handles AMD CPU families K8, 0x10, and 0x11 (ref: static struct pci_hostbridge_probe pci_probes[] __initdata = { ...) and as such, when the 0x15 family was introduced, the response supplied upstream was to add to an existing quirk - commit f62ef5f. I believe with the approach supplied here - patch 1/3 - the 'quirk_amd_nb_node()' quirk becomes obsolete and can be removed. Do you agree? Myron On Wed, Mar 12, 2014 at 3:13 PM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote: > On Tue, Mar 11, 2014 at 12:12 PM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote: >> On Thu, Mar 6, 2014 at 1:03 PM, Suravee Suthikulpanit >> <suravee.suthikulpanit@xxxxxxx> wrote: >>> On 3/6/2014 11:40 AM, Bjorn Helgaas wrote: >>>> >>>> [+cc Yinghai, sorry I didn't think of it before] >>>> >>>> On Wed, Mar 5, 2014 at 11:30 PM, Suravee Suthikulpanit >>>> <suravee.suthikulpanit@xxxxxxx> wrote: >>>>> >>>>> On 3/5/2014 8:13 PM, Suravee Suthikulanit wrote: >>>>>> >>>>>> >>>>>> On 3/5/2014 3:24 PM, Bjorn Helgaas wrote: >>>>>>> >>>>>>> >>>>>>> [+cc linux-acpi] >>>>>>> >>>>>>> On Wed, Mar 5, 2014 at 2:06 PM, <suravee.suthikulpanit@xxxxxxx> wrote: >>>>>>>> >>>>>>>> >>>>>>>> From: Suravee Suthikulpanit <suravee.suthikulpanit@xxxxxxx> >>>>>>>> >>>>>>>> The current code only supports upto AMD hostbridge for family11h. >>>>>>>> This causes PCI numa_node information to be reported incorrectly >>>>>>>> for newer family with multi sockets. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Where is the incorrect reporting? In ACPI tables? Is this patch a >>>>>>> way to cover up firmware defects in the ACPI description? Or is this >>>>>>> for machines without ACPI (it seems unlikely that machines with new >>>>>>> AMD processors would not have ACPI)? >>>>>> >>>>>> >>>>>> >>>>>> This is incorrectly reported in the sysfs for each PCI device (e.g. >>>>>> /devices/pci0000:50/0000:50:00.2/numa_node). Without the patch, they >>>>>> return -1. >>>>>> >>>>>> In file arch/x86/pci/acpi.c, in function pci_acpi_scan_root(), it is >>>>>> queries the node information as following: >>>>>> >>>>>> #ifdef CONFIG_ACPI_NUMA >>>>>> pxm = acpi_get_pxm(device->handle); >>>>>> if (pxm >= 0) >>>>>> node = pxm_to_node(pxm); >>>>>> if (node != -1) >>>>>> set_mp_bus_to_node(busnum, node); >>>>>> else >>>>>> #endif >>>>>> node = get_mp_bus_to_node(busnum); >>>>>> >>>>>> In this case, I see that the acpi_get_pxm() returns -1. Therefore, it >>>>>> falls back to using the node information in mp_bus_to_node[]. So, >>>>>> without this patch, it would also returning -1. >>>>>> >>>>>> Also, the spec mentioned that the _PXM is optional, so I am not sure if >>>>>> this is a firmware bug. >>>>> >>>>> >>>>> I am not quite familiar with the ACPI for this part. However, after >>>>> taking >>>>> a look at the code (in driver/acpi/pci_root.c: acpi_pci_root_add()), I >>>>> believe it's trying to locate _PXM method in the DSDT table, in which I >>>>> don't see any _PXM methods. >>>> >>>> >>>> This sure looks like a firmware bug. True, _PXM is optional, but if >>>> the firmware doesn't provide it, nobody should be surprised that the >>>> OS thinks everything is in the same proximity domain. >>>> >>>> I would not endorse extending amd_bus.c for new CPUs. That just >>>> covers up firmware problems like this, and if you ever run a different >>>> OS on the box, you'll trip over them again. And I don't think a patch >>>> like this will even be a possibility for Windows. >>> >>> I understand and am trying to verify this with the BIOS engineers. However, >>> this is currently affecting family15h servers out in the field. We can try >>> to fix ACPI for newer generation of machines, but it won't be practical to >>> push this BIOS fix to all the BIOS vendors and system vendors for older >>> platforms, as they tend to. >>> >>> What if I localize the extension to the changes to access node information >>> in the hostbridge for just the famil15h which is mostly used in our main >>> server products? Would that be acceptable? >> >> I assume the system is fully functional even without these patches, >> right? The only effect of these changes should be a performance >> improvement. >> >> So the choices are: >> >> 1) Change the BIOS to provide _PXM >> 2) Change Linux with your patches >> >> Either way, the customer has to upgrade something. Choice 1) gets you >> the performance improvement on all Linux and Windows releases, even >> the ones that are already in the field. Choice 2) fixes future Linux >> distros and whatever old releases you can convince the distro to >> backport to, and doesn't help Windows at all. It makes work for all >> the distros and we (i.e., I) have to worry about maintaining this in >> the future. >> >> I'm curious to see what your BIOS folks say. It seems strange that >> after all these years, they wouldn't provide something relatively >> simple like _PXM, so I wonder if there's more to the story. > > I looked at this a bit more. Your patches fill in the > mp_bus_to_node[] table, and pci_acpi_scan_root() uses > get_mp_bus_to_node() to get that information back out. But > pci_acpi_scan_root() only uses get_mp_bus_to_node() if acpi_get_pxm() > fails, or if pxm_to_node() returns -1. > > If acpi_get_pxm() failed, that's pretty good indication that the _PXM > method is missing or broken. If pxm_to_node() failed, it looks like > that could mean the SRAT table is missing or broken, and we didn't > fill in the relevant pxm_to_node_map[] entry. So it's possible that > we do have a _PXM method, but there's something wrong with the SRAT. > > Can you collect an acpidump and complete dmesg log from a system with > the problem? They might be too big for the mailing list; if so, you > can attach them to a kernel.org bugzilla and just include the URL > here. > > Bjorn > -- > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html