Re: [PATCH 3/4] ACPI/MRRM: Add "node" symlink to /sys/devices/system/memory/rangeX

David Hildenbrand <david@xxxxxxxxxx> · Thu, 13 Feb 2025 14:30:59 +0100

On 11.02.25 19:05, Luck, Tony wrote:
What is going to remove this symlink if the memory goes away?  Or do
these never get removed?

symlinks in sysfs created like this always worry me.  What is going to
use it?

On top of that, we seem to be building a separate hierarchy here.

/sys/devices/system/memory/ operates in memory block granularity.

What defines the memory blocks? I'd initially assumed some connection
to the ACPI SRAT table. But on my test system there are only three
entries in SRAT that define non-zero sized memory blocks (two on
socket/node 0 and one on socket/node 1), yet there are:
     memory0 .. memory32 directories
in /sys/devices/system/memory.

Each memory block is the same size (e.g., 128 MiB .. 2 GiB on x86-64).

The default is memory section granularity (e.g., 128 MiB on x86-64), but 
some configs allow for increasing it: see 
arch/x86/mm/init_64.c:memory_block_size_bytes(), and in particular 
probe_memory_block_size().

They define in the granularity in which we can online/offline/add/remove 
physical memory managed by the buddy.

We create these block during boot/during hotplug, and link them to the 
relevant nodes.

They do not reflect the HW state, but the state Linux manages that 
memory (through the buddy).

The phys_device and phys_index files aren't helping me figure out
what each of them mean.

Yes, see Documentation/admin-guide/mm/memory-hotplug.rst

phys_device is a legacy thing for s390x, and phys_index is just the 
memory block ID.

You can derive the address range corresponding to a memory block using 
the ID.

/sys/devices/system/memory/block_size_bytes tells you the size of each 
block.

Address range of block X:
  [ X*block_size_bytes .. (X+1)*block_size_bytes )

Now, the whole interface her is designed for handling memory hotplug:

obj-$(CONFIG_MEMORY_HOTPLUG) += memory.o

It's worth noting that

1) Blocks might not be all-memory (e.g., memory holes). In that case,
   offlining/unplug is not supported.
2) Blocks might span multiple NUMA nodes (e.g., node ends / starts in
   the middle of a block). Similarly, in that case
   offlining/unplug is not supported.

I assume 1) is not a problem. I assume 2) could be a problem for your 
use case.

/sys/devices/system/node/nodeX/ links to memory blocks that belong to it.

Why is the memory-block granularity insufficient, and why do we have to
squeeze in another range API here?

If an MRRM range consists of some set of memory blocks (making
sure that no memory block spans across MRRM range boundaries,
then I could add the {local,remote}_region_id files into the memory
block directories.

This could work now while the region assignments are done by the
BIOS. But in the future when OS gets the opportunity to change them
it might be weird if an MRRM range consists of multiple memory
block range, since the region_ids in each all update together.

What about memory ranges not managed by the buddy (e.g., dax/pmem ranges 
not exposed to the buddy through dax/kmem driver, memory hidden from 
Linux using mem=X etc.)?

/sys/devices/system/memory seemed like a logical place for
memory ranges. But should I jump up a level and make a new
/sys/devices/system/memory_regions directory to expose these
ranges?

Let's take one step back. We do have

1) /proc/iomem to list physical device ranges, without a notion of nodes 
/ other information. Maybe we could extend it, but it might be hard. 
Depending on *what* information we need to expose and for which memory.

/proc/iomem also doesn't indicate "System RAM" for memory not managed by 
the buddy.

2) /sys/devices/system/memory/memoryX and /sys/devices/system/node/

Again, the memory part is more hotplugged focused, and we treat 
individual memory blocks as "memory block devices".

Reading:

"
The MRRM solution is to tag physical address ranges with "region IDs"
so that platform firmware[1] can indicate the type of memory for each
range (with separate tags available for local vs. remote access to
each range).

The region IDs will be used to provide separate event counts for each
region for "perf" and for the "resctrl" file system to monitor and
control memory bandwidth in each region.

Users will need to know the address range(s) that are part of each
region."

A couple of questions:

a) How volatile is that information at runtime? Can ranges / IDs change?
   I read above that user space might in the future be able to
   reconfigure the ranges.

b) How is hotplug/unplug handled?

c) How are memory ranges not managed by Linux handled?

It might make sense to expose what you need in a more specialized, 
acpi/MRRM/perf specific form, and not as generic as you currently 
envision it.

--
Cheers,

David / dhildenb