Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

Joshua Hahn <joshua.hahnjy@xxxxxxxxx> · Mon, 3 Mar 2025 13:56:36 -0800

On Thu, 27 Feb 2025 12:20:03 +0900 Honggyu Kim <honggyu.kim@xxxxxx> wrote:

Hi Honggyu, thank you for taking time to review my patch, as always!
I thought I had sent this, but it seems like it was left in my draft
without being sent. 

I will follow Gregory's advice and we will drop the patch from this series,
and send the first patch only (with Yunjeong's changes). Thanks again!

> 
> On 2/27/2025 11:32 AM, Honggyu Kim wrote:
> > Hi Joshua,
> > 
> > On 2/27/2025 6:35 AM, Joshua Hahn wrote:
> >> We should never try to allocate memory from a memoryless node. Creating a
> >> sysfs knob to control its weighted interleave weight does not make sense,
> >> and can be unsafe.
> >>
> >> Only create weighted interleave weight knobs for nodes with memory.
> >>
> >> Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx>
> >> ---
> >>   mm/mempolicy.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> >> index 4cc04ff8f12c..50cbb7c047fa 100644
> >> --- a/mm/mempolicy.c
> >> +++ b/mm/mempolicy.c
> >> @@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct 
> >> kobject *root_kobj)
> >>           return err;
> >>       }
> >> -    for_each_node_state(nid, N_POSSIBLE) {
> > 
> > Actually, we're aware of this issue and currently trying to fix this.
> > In our system, we've attached 4ch of CXL memory for each socket as
> > follows.
> > 
> >          node0             node1
> >        +-------+   UPI   +-------+
> >        | CPU 0 |-+-----+-| CPU 1 |
> >        +-------+         +-------+
> >        | DRAM0 |         | DRAM1 |
> >        +---+---+         +---+---+
> >            |                 |
> >        +---+---+         +---+---+
> >        | CXL 0 |         | CXL 4 |
> >        +---+---+         +---+---+
> >        | CXL 1 |         | CXL 5 |
> >        +---+---+         +---+---+
> >        | CXL 2 |         | CXL 6 |
> >        +---+---+         +---+---+
> >        | CXL 3 |         | CXL 7 |
> >        +---+---+         +---+---+
> >          node2             node3
> > 
> > The 4ch of CXL memory are detected as a single NUMA node in each socket,
> > but it shows as follows with the current N_POSSIBLE loop.
> > 
> > $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
> > node0 node1 node2 node3 node4 node5
> > node6 node7 node8 node9 node10 node11

I see. For my education, would you mind explaining how the numbering works
here? I am not very familiar with this setup, and not sure how you would
figure out what node is which, just by looking at the numbering.

> >> +    for_each_node_state(nid, N_MEMORY) {
> 
> Thinking it again, we can leave it as a separate patch but add our patch 
> on top of it.

That sounds good to me. 

> The only concern I have is having only N_MEMORY patch hides weight
> setting knobs for CXL memory and it makes there is no way to set weight 
> values to CXL memory in my system.

You can use weighted interleave auto-tuning : -)
In all seriousness, this makes sense. It seems pretty problematic that
the knobs aren't created for the CXL channels, and I'm not sure that hiding
it is the correct approach here (it was not my intent, either).

> IMHO, this and our patch is better to be submitted together.

That sounds good. We can hold off on this patch then, and just consider
the first patch of this series. Thank you for letting me know!

Thank you for always reviewing my patches. Have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)