Re: [PATCH -V10 RESEND 2/6] NUMA balancing: optimize page placement for memory tiering system

"Huang, Ying" <ying.huang@xxxxxxxxx> · Wed, 08 Dec 2021 11:16:28 +0800

Hasan Al Maruf <hasan3050@xxxxxxxxx> writes:

> Hi Huang,
>
>>+void set_numabalancing_state(bool enabled)
>>+{
>>+	if (enabled)
>>+		sysctl_numa_balancing_mode = NUMA_BALANCING_NORMAL;
>>+	else
>>+		sysctl_numa_balancing_mode = NUMA_BALANCING_DISABLED;
>>+	__set_numabalancing_state(enabled);
>>+}
>>+
>
> One of the properties of optimized NUMA Balancing for tiered memory is we
> are not going to scan top-tier nodes as promotion doesn't make sense there
> (implemented in the next patch [3/6]). However, if a system has only
> single memory node with CPU, does it make sense to run
> `NUMA_BALANCING_NORMAL` mode there? What do you think about downgrading to
> `NUMA_BALANCING_MEMORY_TIERING` mode if a user setup NUMA Balancing on
> the default mode of `NUMA_BALANCING_NORMAL` on a single toptier memory
> node?

Consider a system with only 1 NUMA node and no PMEM, should we refuse
NUMA balancing to be enabled at all?

Per my understanding, the philosophy behind is to keep thing as small as
possible instead of as smart as possible.  Do you agree?

>>diff --git a/mm/vmscan.c b/mm/vmscan.c
>>index c266e64d2f7e..5edb5dfa8900 100644
>>--- a/mm/vmscan.c
>>+++ b/mm/vmscan.c
>>@@ -56,6 +56,7 @@
>>
>> #include <linux/swapops.h>
>> #include <linux/balloon_compaction.h>
>>+#include <linux/sched/sysctl.h>
>>
>> #include "internal.h"
>>
>>@@ -3919,6 +3920,12 @@ static bool pgdat_watermark_boosted(pg_data_t *pgdat, int highest_zoneidx)
>> 	return false;
>> }
>>
>>+/*
>>+ * Keep the free pages on fast memory node a little more than the high
>>+ * watermark to accommodate the promoted pages.
>>+ */
>>+#define NUMA_BALANCING_PROMOTE_WATERMARK	(10UL * 1024 * 1024 >> PAGE_SHIFT)
>>+
>> /*
>>  * Returns true if there is an eligible zone balanced for the request order
>>  * and highest_zoneidx
>>@@ -3940,6 +3947,15 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
>> 			continue;
>>
>> 		mark = high_wmark_pages(zone);
>>+		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
>>+		    numa_demotion_enabled &&
>>+		    next_demotion_node(pgdat->node_id) != NUMA_NO_NODE) {
>>+			unsigned long promote_mark;
>>+
>>+			promote_mark = min(NUMA_BALANCING_PROMOTE_WATERMARK,
>>+					   pgdat->node_present_pages >> 6);
>>+			mark += promote_mark;
>>+		}
>> 		if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
>> 			return true;
>> 	}
>
> This can be moved to a different patch. I think, this patch [2/6] can be
> splitted into two basic patches -- 1. NUMA Balancing interface for tiered
> memory and 2. maintaining a headroom for promotion.

Johannes has taught me that, if we introduce a new function, variable,
or interface, it's better to introduce its user together.  So that we
can determine whether it's necessary to do that, whether the definition
is suitable, etc.  I think that makes sense.  So I try to do that in
this patchset too.

As in [2/5] of your patchset below, another possibility is to make
1. NUMA balancing interface for tiered memory and 2. skip scanning top
tier memory in NUMA balancing in one patch.  One concern is that
although this is an optimization, there's almost no measurable
performance difference.  This makes it hard to justify to extend the
user space interface.  Do you have better data to support this?

> Instead of having a static value for `NUMA_BALANCING_PROMOTE_WATERMARK`
> what about decoupling the allocation and reclamation and add a user-space
> interface for controling them?

This means to add a new user space ABI.  Because we may need to support
the new ABI forever, we should have strong justification to add it.
I am not against to add an ABI to adjust promotion watermark in
general.  I think that the path could be,

- Have a simplest solution that works without introducing new ABI, like
  something in this patch, or revised.

- Then try to add a new ABI in a separate patch with enough
  justification, for example, with much improved performance data.

Do you agree?

> Do you think patch [2/5] and [3/5] of this series can be merged to your
> current patchset?
>
> https://lore.kernel.org/all/cover.1637778851.git.hasanalmaruf@xxxxxx/

Best Regards,
Huang, Ying