Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> writes: > On 04/07/2024 02:23, Huang, Ying wrote: >> Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> writes: >> >>> On 03/07/2024 08:57, Huang, Ying wrote: >>>> Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> writes: >>>> >>>>> On 03/07/2024 06:28, Huang, Ying wrote: >>>>>> Tvrtko Ursulin <tursulin@xxxxxxxxxx> writes: >>>>>> >>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> >>>>>>> >>>>>>> Since balancing mode was added in >>>>>>> bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes"), >>>>>>> it was possible to set this mode but it wouldn't be shown in >>>>>>> /proc/<pid>/numa_maps since there was no support for it in the >>>>>>> mpol_to_str() helper. >>>>>>> >>>>>>> Furthermore, because the balancing mode sets the MPOL_F_MORON flag, it >>>>>>> would be displayed as 'default' due a workaround introduced a few years >>>>>>> earlier in >>>>>>> 8790c71a18e5 ("mm/mempolicy.c: fix mempolicy printing in numa_maps"). >>>>>>> >>>>>>> To tidy this up we implement two changes: >>>>>>> >>>>>>> First we introduce a new internal flag MPOL_F_KERNEL and with it mark the >>>>>>> kernel's internal default and fallback policies (for tasks and/or VMAs >>>>>>> with no explicit policy set). By doing this we generalise the current >>>>>>> special casing and replace the incorrect 'default' with the correct >>>>>>> 'bind'. >>>>>>> >>>>>>> Secondly, we add a string representation and corresponding handling for >>>>>>> MPOL_F_NUMA_BALANCING. We do this by adding a sparse mapping array of >>>>>>> flags to names. With the sparseness being the downside, but with the >>>>>>> advantage of generalising and removing the "policy" from flags display. >>>>>> Please split these 2 changes into 2 patches. Because we will need >>>>>> to >>>>>> back port the first one to -stable kernel. >>>>> >>>>> Why two? AFAICT there wasn't a issue until bda420b98505, and to fix it >>>>> all changes from this patch are needed. >>>> After bda420b98505, MPOL_BIND with MPOL_F_NUMA_BALANCING will be >>>> shown >>>> as "default", which is a bug. While it's a new feature to show >>>> "balancing". The first fix should be back-ported to -stable kernel >>>> after bda420b98505. While we don't need to do that for the second one. >>> >>> You lost me but it could be I am not at my best today so if you could >>> please explain more precisely what you mean? >>> >>> When bda420b98505 got in, it added MPOL_F_NUMA_BALANCING. But there >>> was no "balancing" in mpol_to_str(). That's one fix for bda420b98505. >> IMO, it's not a big issue to miss "balancing" in mpol_to_str(). >> It's >> not absolutely necessary to backport this part. >> >>> But also it did not change the pre-existing check for MPOL_F_MORON >>> added in 8790c71a18e5, many years before it, which was the thing >>> causing bind+balancing to be printed as default. So that's the second >>> part of the fix. But also AFAICS to tag as fixes bda420b98505. >>> >>> Making 8790c71a18e5 target of Fixes: does not IMO make sense though >>> because *at the time* of that patch it wasn't broken. What am I >>> missing? >> Yes, we should use "Fixes: bda420b98505 ..." for this part. This is >> a >> big issue, because "default" will be shown for MPOL_BIND, which is > > Not for MPOL_BIND, only when MPOL_F_NUMA_BALANCING is also set. Plain > bind works fine. > >> totally wrong. We need to backport this fix. It's good for backporting >> to keep it small and focused. > > I agree in principle fixes should be kept small but question is what > this is fixing. Is "fixing" bind+balancing to display as only "bind" > _the_ fix we want? Is it worth doing for 5.12 to 6.10, while only 6.11 > potentially to actually show the accurate information? Yes. This is my opinion. >>>>>>> End result: >>>>>>> >>>>>>> $ numactl -b -m 0-1,3 cat /proc/self/numa_maps >>>>>>> 555559580000 bind=balancing:0-1,3 file=/usr/bin/cat mapped=3 active=0 N0=3 kernelpagesize_kB=16 >>>>>>> ... >>>>>>> >>>>>>> v2: >>>>>>> * Fully fix by introducing MPOL_F_KERNEL. >>>>>>> >>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> >>>>>>> Fixes: bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") >>>>>>> References: 8790c71a18e5 ("mm/mempolicy.c: fix mempolicy printing in numa_maps") >>>>>>> Cc: Huang Ying <ying.huang@xxxxxxxxx> >>>>>>> Cc: Mel Gorman <mgorman@xxxxxxx> >>>>>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> >>>>>>> Cc: Ingo Molnar <mingo@xxxxxxxxxx> >>>>>>> Cc: Rik van Riel <riel@xxxxxxxxxxx> >>>>>>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> >>>>>>> Cc: "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> >>>>>>> Cc: Dave Hansen <dave.hansen@xxxxxxxxx> >>>>>>> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx> >>>>>>> Cc: Michal Hocko <mhocko@xxxxxxxx> >>>>>>> Cc: David Rientjes <rientjes@xxxxxxxxxx> >>>>>>> --- >>>>>>> include/uapi/linux/mempolicy.h | 1 + >>>>>>> mm/mempolicy.c | 44 ++++++++++++++++++++++++---------- >>>>>>> 2 files changed, 32 insertions(+), 13 deletions(-) >>>>>>> >>>>>>> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h >>>>>>> index 1f9bb10d1a47..bcf56ce9603b 100644 >>>>>>> --- a/include/uapi/linux/mempolicy.h >>>>>>> +++ b/include/uapi/linux/mempolicy.h >>>>>>> @@ -64,6 +64,7 @@ enum { >>>>>>> #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ >>>>>>> #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ >>>>>>> #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ >>>>>>> +#define MPOL_F_KERNEL (1 << 5) /* Kernel's internal policy */ >>>>>>> /* >>>>>>> * These bit locations are exposed in the vm.zone_reclaim_mode sysctl >>>>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>>>>>> index aec756ae5637..8ecc6d9f100a 100644 >>>>>>> --- a/mm/mempolicy.c >>>>>>> +++ b/mm/mempolicy.c >>>>>>> @@ -134,6 +134,7 @@ enum zone_type policy_zone = 0; >>>>>>> static struct mempolicy default_policy = { >>>>>>> .refcnt = ATOMIC_INIT(1), /* never free it */ >>>>>>> .mode = MPOL_LOCAL, >>>>>>> + .flags = MPOL_F_KERNEL, >>>>>>> }; >>>>>>> static struct mempolicy preferred_node_policy[MAX_NUMNODES]; >>>>>>> @@ -3095,7 +3096,7 @@ void __init numa_policy_init(void) >>>>>>> preferred_node_policy[nid] = (struct mempolicy) { >>>>>>> .refcnt = ATOMIC_INIT(1), >>>>>>> .mode = MPOL_PREFERRED, >>>>>>> - .flags = MPOL_F_MOF | MPOL_F_MORON, >>>>>>> + .flags = MPOL_F_MOF | MPOL_F_MORON | MPOL_F_KERNEL, >>>>>>> .nodes = nodemask_of_node(nid), >>>>>>> }; >>>>>>> } >>>>>>> @@ -3150,6 +3151,12 @@ static const char * const policy_modes[] = >>>>>>> [MPOL_PREFERRED_MANY] = "prefer (many)", >>>>>>> }; >>>>>>> +static const char * const policy_flags[] = { >>>>>>> + [ilog2(MPOL_F_STATIC_NODES)] = "static", >>>>>>> + [ilog2(MPOL_F_RELATIVE_NODES)] = "relative", >>>>>>> + [ilog2(MPOL_F_NUMA_BALANCING)] = "balancing", >>>>>>> +}; >>>>>>> + >>>>>>> #ifdef CONFIG_TMPFS >>>>>>> /** >>>>>>> * mpol_parse_str - parse string to mempolicy, for tmpfs mpol mount option. >>>>>>> @@ -3293,17 +3300,18 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) >>>>>>> * @pol: pointer to mempolicy to be formatted >>>>>>> * >>>>>>> * Convert @pol into a string. If @buffer is too short, truncate the string. >>>>>>> - * Recommend a @maxlen of at least 32 for the longest mode, "interleave", the >>>>>>> - * longest flag, "relative", and to display at least a few node ids. >>>>>>> + * Recommend a @maxlen of at least 42 for the longest mode, "weighted >>>>>>> + * interleave", the longest flag, "balancing", and to display at least a few >>>>>>> + * node ids. >>>>>>> */ >>>>>>> void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) >>>>>>> { >>>>>>> char *p = buffer; >>>>>>> nodemask_t nodes = NODE_MASK_NONE; >>>>>>> unsigned short mode = MPOL_DEFAULT; >>>>>>> - unsigned short flags = 0; >>>>>>> + unsigned long flags = 0; >>>>>>> - if (pol && pol != &default_policy && !(pol->flags & >>>>>>> MPOL_F_MORON)) { >>>>>>> + if (!(pol->flags & MPOL_F_KERNEL)) { >>>>>> Can we avoid to introduce a new flag? Whether the following code >>>>>> work? >>>>>> if (pol && pol != &default_policy && !(pol->mode != >>>>>> MPOL_PREFERRED) && !(pol->flags & MPOL_F_MORON)) >>>>>> But I think that this is kind of fragile. A flag is better. But >>>>>> personally, I don't think MPOL_F_KERNEL is a good name, maybe >>>>>> MPOL_F_DEFAULT? >>>>> >>>>> I thought along the same lines, but as you have also shown we need to >>>>> exclude both default and preferred fallbacks so naming the flag >>>>> default did not feel best. MPOL_F_INTERNAL? MPOL_F_FALLBACK? >>>>> MPOL_F_SHOW_AS_DEFAULT? :)) >>>>> >>>>> What I dislike about the flag more is the fact internal flags are for >>>>> some reason in the uapi headers. And presumably we cannot zap them. >>>>> >>>>> But I don't think we can check for MPOL_PREFERRED since it can be a >>>>> legitimate user set policy. >>>> It's not legitimate (yet) to use MPOL_PREFERRED + >>>> MPOL_F_NUMA_BALANCING. >>>> >>>>> >>>>> We could check for the address of preferred_node_policy[] members with >>>>> a loop covering all possible nids? If that will be the consensus I am >>>>> happy to change it. But flag feels more elegant and robust. >>>> Yes. I think that this is doable. >>>> (unsigned long)addr >= (unsigned >>>> long)(preferred_node_policy) && \ >>>> (unsigned long)addr < (unsigned long)(preferred_node_policy) + \ >>>> sizeof(preferred_node_policy) >>> >>> Not the prettiest but at least in the spirit of the existing >>> &default_policy check. I can do that, no problem. If someone has a >>> different opinion please shout soon. >>> >>>>>>> mode = pol->mode; >>>>>>> flags = pol->flags; >>>>>>> } >>>>>>> @@ -3328,15 +3336,25 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) >>>>>>> p += snprintf(p, maxlen, "%s", policy_modes[mode]); >>>>>>> if (flags & MPOL_MODE_FLAGS) { >>>>>>> - p += snprintf(p, buffer + maxlen - p, "="); >>>>>>> + unsigned int bit, cnt = 0; >>>>>>> - /* >>>>>>> - * Currently, the only defined flags are mutually exclusive >>>>>>> - */ >>>>>>> - if (flags & MPOL_F_STATIC_NODES) >>>>>>> - p += snprintf(p, buffer + maxlen - p, "static"); >>>>>>> - else if (flags & MPOL_F_RELATIVE_NODES) >>>>>>> - p += snprintf(p, buffer + maxlen - p, "relative"); >>>>>>> + for_each_set_bit(bit, &flags, ARRAY_SIZE(policy_flags)) { >>>>>>> + if (bit <= ilog2(MPOL_F_KERNEL)) >>>>>>> + continue; >>>>>>> + >>>>>>> + if (cnt == 0) >>>>>>> + p += snprintf(p, buffer + maxlen - p, "="); >>>>>>> + else >>>>>>> + p += snprintf(p, buffer + maxlen - p, ","); >>>>>>> + >>>>>>> + if (WARN_ON_ONCE(!policy_flags[bit])) >>>>>>> + p += snprintf(p, buffer + maxlen - p, "bit%u", >>>>>>> + bit); >>>>>>> + else >>>>>>> + p += snprintf(p, buffer + maxlen - p, >>>>>>> + policy_flags[bit]); >>>>>>> + cnt++; >>>>>>> + } >>>>>> Please refer to commit 2291990ab36b ("mempolicy: clean-up >>>>>> mpol-to-str() >>>>>> mempolicy formatting") for the original format. >>>>> >>>>> That was in 2008 so long time ago and in the meantime there were no >>>>> bars. The format in this patch tries to align with the input format >>>>> and I think it manages, apart from deciding to print unknown flags as >>>>> bit numbers (which is most probably an irrelevant difference). Why do >>>>> you think the pre-2008 format is better? >>>> If you think that your format is better, please explain why you not >>>> use >>>> the original format in the patch description. You can also show >>>> examples to compare. >>> >>> Because there is no "old" format? If you refer to the one which ended >>> in 2008. Or if you refer to the one this patch replaces, then it is >>> effectively the same format for a single flag. And for multiple flags >>> before this patch that wasn't a possibility. So I am not sure what I >>> would include as a comparison. Broken "default" vs >>> "bind=balancing:0-1"? Am I missing something? >> In the old format (not in the old code), it is, >> bind=relative|balancing:0-1 >> while in your format, >> bind=relative,balancing:0-1 >> Please explain why you make the change. > > I can explain the format in the commit message more, if that what you > are suggesting? Thanks, please do that. > But as for "making the change" we may be talking past each other I > just don't know how. Again, could you please explain how am I changing > the format? Where is the old format? AFAICS it does not exist in > reality. The bar formatting ended in 2.6.26. OK. One choice is to follow the old bar format, another choice is to invent a new format. -- Best Regards, Huang, Ying