==== Quick summary ==== This series adds kernel support for the Heterogeneous Memory Attribute Table (HMAT) table, newly defined in ACPI 6.2: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf The HMAT table, in concert with the existing System Resource Affinity Table (SRAT), provides users with information about memory initiators and memory targets in the system. A "memory initiator" in this case is any device such as a CPU or a separate memory I/O device that can initiate a memory request. A "memory target" is a CPU-accessible physical address range. The HMAT provides performance information (expected latency and bandwidth, etc.) for various (initiator,target) pairs. This is mostly motivated by the need to optimally use performance-differentiated DRAM, but it also allows us to describe the performance characteristics of persistent memory. The purpose of this RFC is to gather feedback on the different options for enabling the HMAT in the kernel and in userspace. ==== Lots of details ==== The HMAT only covers CPU-addressable memory types, not on-device memory like what we have with Jerome Glisse's HMM series: https://lkml.org/lkml/2017/5/24/731 One major conceptual change in ACPI 6.2 related to this work is that proximity domains no longer need to contain a processor. We can now have memory-only proximity domains, which means that we can now have memory-only Linux NUMA nodes. Here is an example configuration where we have a single processor, one range of regular memory and one range of High Bandwidth Memory (HBM): +---------------+ +----------------+ | Processor | | Memory | | prox domain 0 +---+ prox domain 1 | | NUMA node 1 | | NUMA node 2 | +-------+-------+ +----------------+ | +-------+----------+ | HBM | | prox domain 2 | | NUMA node 0 | +------------------+ This gives us one initiator (the processor) and two targets (the two memory ranges). Each of these three has its own ACPI proximity domain and associated Linux NUMA node. Note also that while there is a 1:1 mapping from each proximity domain to each NUMA node, the numbers don't necessarily match up. Additionally we can have extra NUMA nodes that don't map back to ACPI proximity domains. The above configuration could also have the processor and one of the two memory ranges sharing a proximity domain and NUMA node, but for the purposes of the HMAT the two memory ranges will always need to be separated. The overall goal of this series and of the HMAT is to allow users to identify memory using its performance characteristics. This can broadly be done in one of two ways: Option 1: Provide the user with a way to map between proximity domains and NUMA nodes and a way to access the HMAT directly (probably via /sys/firmware/acpi/tables). Then, through possibly a library and a daemon, provide an API so that applications can either request information about memory ranges, or request memory allocations that meet a given set of performance characteristics. Option 2: Provide the user with HMAT performance data directly in sysfs, allowing applications to directly access it without the need for the library and daemon. The kernel work for option 1 is started by patches 1-4. These just surface the minimal amount of information in sysfs to allow userspace to map between proximity domains and NUMA nodes so that the raw data in the HMAT table can be understood. Patches 5 and 6 enable option 2, adding performance information from the HMAT to sysfs. The second option is complicated by the amount of HMAT data that could be present in very large systems, so in this series we only surface performance information for local (initiator,target) pairings. The changelog for patch 6 discusses this in detail. ==== Next steps ==== There is still a lot of work to be done on this series, but the overall goal of this RFC is to gather feedback on which of the two options we should pursue, or whether some third option is preferred. After that is done and we have a solid direction we can add support for ACPI hot add, test more complex configurations, etc. So, for applications that need to differentiate between memory ranges based on their performance, what option would work best for you? Is the local (initiator,target) performance provided by patch 6 enough, or do you require performance information for all possible (initiator,target) pairings? If option 1 looks best, do we have ideas on what the userspace API would look like? For option 2 Dan Williams had suggested that it may be worthwhile to allow for multiple memory initiators to be listed as "local" if they all have the same performance, even if the HMAT's Memory Subsystem Address Range Structure table only defines a single local initiator. Do others agree? What other things should we consider, or what needs do you have that aren't being addressed? Ross Zwisler (6): ACPICA: add HMAT table definitions acpi: add missing include in acpi_numa.h acpi: HMAT support in acpi_parse_entries_array() hmem: add heterogeneous memory sysfs support sysfs: add sysfs_add_group_link() hmem: add performance attributes MAINTAINERS | 5 + drivers/acpi/Kconfig | 1 + drivers/acpi/Makefile | 1 + drivers/acpi/hmem/Kconfig | 7 + drivers/acpi/hmem/Makefile | 2 + drivers/acpi/hmem/core.c | 679 ++++++++++++++++++++++++++++++++++++ drivers/acpi/hmem/hmem.h | 56 +++ drivers/acpi/hmem/initiator.c | 61 ++++ drivers/acpi/hmem/perf_attributes.c | 158 +++++++++ drivers/acpi/hmem/target.c | 97 ++++++ drivers/acpi/numa.c | 2 +- drivers/acpi/tables.c | 52 ++- fs/sysfs/group.c | 30 +- include/acpi/acpi_numa.h | 1 + include/acpi/actbl1.h | 119 +++++++ include/linux/sysfs.h | 2 + 16 files changed, 1254 insertions(+), 19 deletions(-) create mode 100644 drivers/acpi/hmem/Kconfig create mode 100644 drivers/acpi/hmem/Makefile create mode 100644 drivers/acpi/hmem/core.c create mode 100644 drivers/acpi/hmem/hmem.h create mode 100644 drivers/acpi/hmem/initiator.c create mode 100644 drivers/acpi/hmem/perf_attributes.c create mode 100644 drivers/acpi/hmem/target.c -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html