On Tue, Mar 11, 2014 at 12:12 PM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote: > On Thu, Mar 6, 2014 at 1:03 PM, Suravee Suthikulpanit > <suravee.suthikulpanit@xxxxxxx> wrote: >> On 3/6/2014 11:40 AM, Bjorn Helgaas wrote: >>> >>> [+cc Yinghai, sorry I didn't think of it before] >>> >>> On Wed, Mar 5, 2014 at 11:30 PM, Suravee Suthikulpanit >>> <suravee.suthikulpanit@xxxxxxx> wrote: >>>> >>>> On 3/5/2014 8:13 PM, Suravee Suthikulanit wrote: >>>>> >>>>> >>>>> On 3/5/2014 3:24 PM, Bjorn Helgaas wrote: >>>>>> >>>>>> >>>>>> [+cc linux-acpi] >>>>>> >>>>>> On Wed, Mar 5, 2014 at 2:06 PM, <suravee.suthikulpanit@xxxxxxx> wrote: >>>>>>> >>>>>>> >>>>>>> From: Suravee Suthikulpanit <suravee.suthikulpanit@xxxxxxx> >>>>>>> >>>>>>> The current code only supports upto AMD hostbridge for family11h. >>>>>>> This causes PCI numa_node information to be reported incorrectly >>>>>>> for newer family with multi sockets. >>>>>> >>>>>> >>>>>> >>>>>> Where is the incorrect reporting? In ACPI tables? Is this patch a >>>>>> way to cover up firmware defects in the ACPI description? Or is this >>>>>> for machines without ACPI (it seems unlikely that machines with new >>>>>> AMD processors would not have ACPI)? >>>>> >>>>> >>>>> >>>>> This is incorrectly reported in the sysfs for each PCI device (e.g. >>>>> /devices/pci0000:50/0000:50:00.2/numa_node). Without the patch, they >>>>> return -1. >>>>> >>>>> In file arch/x86/pci/acpi.c, in function pci_acpi_scan_root(), it is >>>>> queries the node information as following: >>>>> >>>>> #ifdef CONFIG_ACPI_NUMA >>>>> pxm = acpi_get_pxm(device->handle); >>>>> if (pxm >= 0) >>>>> node = pxm_to_node(pxm); >>>>> if (node != -1) >>>>> set_mp_bus_to_node(busnum, node); >>>>> else >>>>> #endif >>>>> node = get_mp_bus_to_node(busnum); >>>>> >>>>> In this case, I see that the acpi_get_pxm() returns -1. Therefore, it >>>>> falls back to using the node information in mp_bus_to_node[]. So, >>>>> without this patch, it would also returning -1. >>>>> >>>>> Also, the spec mentioned that the _PXM is optional, so I am not sure if >>>>> this is a firmware bug. >>>> >>>> >>>> I am not quite familiar with the ACPI for this part. However, after >>>> taking >>>> a look at the code (in driver/acpi/pci_root.c: acpi_pci_root_add()), I >>>> believe it's trying to locate _PXM method in the DSDT table, in which I >>>> don't see any _PXM methods. >>> >>> >>> This sure looks like a firmware bug. True, _PXM is optional, but if >>> the firmware doesn't provide it, nobody should be surprised that the >>> OS thinks everything is in the same proximity domain. >>> >>> I would not endorse extending amd_bus.c for new CPUs. That just >>> covers up firmware problems like this, and if you ever run a different >>> OS on the box, you'll trip over them again. And I don't think a patch >>> like this will even be a possibility for Windows. >> >> I understand and am trying to verify this with the BIOS engineers. However, >> this is currently affecting family15h servers out in the field. We can try >> to fix ACPI for newer generation of machines, but it won't be practical to >> push this BIOS fix to all the BIOS vendors and system vendors for older >> platforms, as they tend to. >> >> What if I localize the extension to the changes to access node information >> in the hostbridge for just the famil15h which is mostly used in our main >> server products? Would that be acceptable? > > I assume the system is fully functional even without these patches, > right? The only effect of these changes should be a performance > improvement. > > So the choices are: > > 1) Change the BIOS to provide _PXM > 2) Change Linux with your patches > > Either way, the customer has to upgrade something. Choice 1) gets you > the performance improvement on all Linux and Windows releases, even > the ones that are already in the field. Choice 2) fixes future Linux > distros and whatever old releases you can convince the distro to > backport to, and doesn't help Windows at all. It makes work for all > the distros and we (i.e., I) have to worry about maintaining this in > the future. > > I'm curious to see what your BIOS folks say. It seems strange that > after all these years, they wouldn't provide something relatively > simple like _PXM, so I wonder if there's more to the story. I looked at this a bit more. Your patches fill in the mp_bus_to_node[] table, and pci_acpi_scan_root() uses get_mp_bus_to_node() to get that information back out. But pci_acpi_scan_root() only uses get_mp_bus_to_node() if acpi_get_pxm() fails, or if pxm_to_node() returns -1. If acpi_get_pxm() failed, that's pretty good indication that the _PXM method is missing or broken. If pxm_to_node() failed, it looks like that could mean the SRAT table is missing or broken, and we didn't fill in the relevant pxm_to_node_map[] entry. So it's possible that we do have a _PXM method, but there's something wrong with the SRAT. Can you collect an acpidump and complete dmesg log from a system with the problem? They might be too big for the mailing list; if so, you can attach them to a kernel.org bugzilla and just include the URL here. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html