Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Gregory,

On 3/5/2025 1:29 AM, Gregory Price wrote:
On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
Actually, we're aware of this issue and currently trying to fix this.
In our system, we've attached 4ch of CXL memory for each socket as
follows.

         node0             node1
       +-------+   UPI   +-------+
       | CPU 0 |-+-----+-| CPU 1 |
       +-------+         +-------+
       | DRAM0 |         | DRAM1 |
       +---+---+         +---+---+
           |                 |
       +---+---+         +---+---+
       | CXL 0 |         | CXL 4 |
       +---+---+         +---+---+
       | CXL 1 |         | CXL 5 |
       +---+---+         +---+---+
       | CXL 2 |         | CXL 6 |
       +---+---+         +---+---+
       | CXL 3 |         | CXL 7 |
       +---+---+         +---+---+
         node2             node3

The 4ch of CXL memory are detected as a single NUMA node in each socket,
but it shows as follows with the current N_POSSIBLE loop.

$ ls /sys/kernel/mm/mempolicy/weighted_interleave/
node0 node1 node2 node3 node4 node5
node6 node7 node8 node9 node10 node11

This is insufficient information for me to assess the correctness of the
configuration. Can you please show the contents of your CEDT/CFMWS and
SRAT/Memory Affinity structures?

mkdir acpi_data && cd acpi_data
acpidump -b
iasl -d *
cat cedt.dsl  <- find all CFMWS entries
cat srat.dsl  <- find all Memory Affinity entries

I'm not able to provide all the details as srat.dsl has too much info.

  $ wc -l srat.dsl
  25229 srat.dsl

Instead, I can show you that there are 4 diffferent proximity domains
with "Enabled : 1" with the following filtered output from srat.dsl.

$ grep -E "Proximity Domain :|Enabled : " srat.dsl | cut -c 31- | sed 'N;s/\n//' | sort | uniq
         Enabled : 0       Enabled : 0
  Proximity Domain : 00000000       Enabled : 0
  Proximity Domain : 00000000       Enabled : 1
  Proximity Domain : 00000001       Enabled : 1
  Proximity Domain : 00000006       Enabled : 1
  Proximity Domain : 00000007       Enabled : 1

We don't actually have to use those complicated commands to check this
as dmesg clearly prints the SRAT and node numbers as follows.

  [    0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
  [    0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
[ 0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 0x60f80000000-0x64f7fffffff] [ 0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 0x2080000000-0x807fffffff] hotplug [ 0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 0x64f80000000-0x6cf7fffffff] hotplug

The memoryless nodes are printed as follows after those ACPI, SRAT,
Node N PXM M messages.

[ 0.010927] Initmem setup node 0 [mem 0x0000000000001000-0x000000207effffff] [ 0.010930] Initmem setup node 1 [mem 0x0000060f80000000-0x0000064f7fffffff]
  [    0.010992] Initmem setup node 2 as memoryless
  [    0.011055] Initmem setup node 3 as memoryless
  [    0.011115] Initmem setup node 4 as memoryless
  [    0.011177] Initmem setup node 5 as memoryless
  [    0.011238] Initmem setup node 6 as memoryless
  [    0.011299] Initmem setup node 7 as memoryless
  [    0.011361] Initmem setup node 8 as memoryless
  [    0.011422] Initmem setup node 9 as memoryless
  [    0.011484] Initmem setup node 10 as memoryless
  [    0.011544] Initmem setup node 11 as memoryless

This is related why the 12 nodes at sysfs knobs are provided with the
current N_POSSIBLE loop.


Basically I need to know:
1) Is each CXL device on a dedicated Host Bridge?
2) Is inter-host-bridge interleaving configured?
3) Is intra-host-bridge interleaving configured?
4) Do SRAT entries exist for all nodes?

Are there some simple commands that I can get those info?

5) Why are there 12 nodes but only 10 sources? Are there additional
    devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
    Affinity records - resulting in 10 nodes? This is strange.

My blind guess is that there could be a logic node that combines 4ch of
CXL memory so there are 5 nodes per each socket.  Adding 2 nodes for
local CPU/DRAM makes 12 nodes in total.


By default, Linux creates a node for each proximity domain ("PXM")
detected in the SRAT Memory Affinity tables. If SRAT entries for a
memory region described in a CFMWS is absent, it will also create an
node for that CFMWS.

Your reported configuration and results lead me to believe you have
a combination of CFMWS/SRAT configurations that are unexpected.

~Gregory

Not sure about this part but our approach with hotplug_memory_notifier()
resolves this problem. Rakie will submit an initial working patchset soonish.

Thanks,
Honggyu




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux