Re: [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full

"Huang, Ying" <ying.huang@xxxxxxxxx> · Tue, 22 Oct 2024 14:36:56 +0800

MengEn Sun <mengensun88@xxxxxxxxx> writes:

> Thank you for your suggestion. I understand and am ready to make
> some changes  
>
>> 
>> Have verified the issue with some test?  If not, I suggest you to do
>> that.
>> 
>
> I have conducted tests:
> Applying this patch or not does not have a significant impact on the
> test results.

I don't expect some measurable performance difference with the patch.
If we can observe that the PCP size isn't tuned down to high_min before
and is after, that should be a valid test result to show the value of
the patch.  Can you try that?  The PCP size can be observed via
/proc/zoneinfo.

> perhaps my testing was not thorough enough. #^_^
>
> But, the logic of the code is like following:
> CPU0                              CPUx
> ----                              -----
>                                   T0: vmstat_work is pending
> T1: vmstat_shepherd
>     check vmstat_work
>     and do nothing
>                                   T2: vmstat_work is in unpending state.
>
>                                   T3: alloc many pages
>                                   T4: free all the pages allocated at T3
>                                   T5: entry NOHZ, flushing all zonestats
>                                       and nodestats
> T6: next vmstat_shepherd fired
>
> In my opinion, there are indeed some issues. I'm not sure if there's
> something I haven't understood?
>
>
> By the way,
> There are two other questions for me:
> Q1:
> Vmstat_work is a **deferreable work** So, It may be delayed for a long time
> by NOHZ. As a result, "vmstat_update() may not be executed once every
> second in the above scenario. Therefore, I'm not sure if using a deferrable
> work to reduce pcp->high is appropriate. In my tests, if I don't use
> deferrable work, it takes about a minute to reduce high to high_min, but
> using deferrable work may take several minutes to reduce high to high_min.

It's not a big issue to take longer time to decay pcp->high.

> Q2:
> On a big machine, for example, with 1TB of memory, the default maximum
> memory on PCP can be 1TB * 0.125.
> This portion of memory is not accounted for in MemFree in /proc/meminfo.
> Users can see this portion of memory from /proc/zoneinfo, but the memory
> reported by the `free` command is reduced.
> can we include the PCP memory in the MemFree statistic in /proc/meminfo?

This has been discussed before.

https://lore.kernel.org/linux-mm/20220816084426.135528-1-wangkefeng.wang@xxxxxxxxxx/
https://lore.kernel.org/linux-mm/20240830014453.3070909-1-mawupeng1@xxxxxxxxxx/

>> > While, This seems to be fine:
>> > - if freeing and allocating memory occur later, it may the
>> >   high_max may be adjust automatically
>> > - If memory is tight, the memory reclamation process will
>> >   release the pcp
>> 
>> This could be a real issue for me.
>
> Thanks, I will test more carefully for those issue
>
>> 
>> > Whatever, we make vmstat_shepherd to checking whether we need
>> > decay pcp high_max, and fire pcp_decay_high early if we need.
>> >
>> > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
>> > Reviewed-by: Jinliang Zheng <alexjlzheng@xxxxxxxxxxx>
>> > Signed-off-by: MengEn Sun <mengensun@xxxxxxxxxxx>
>> > ---
>> > changelog:
>> > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@xxxxxxxxxxxxxxxxxxxx/T/#t
>> > v2: Make the commit message clearer by adding some comments.
>> > ---
>> >  mm/vmstat.c | 9 +++++++++
>> >  1 file changed, 9 insertions(+)
>> >
>> > diff --git a/mm/vmstat.c b/mm/vmstat.c
>> > index 1917c034c045..07b494b06872 100644
>> > --- a/mm/vmstat.c
>> > +++ b/mm/vmstat.c
>> > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
>> >  
>> >  	for_each_populated_zone(zone) {
>> >  		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
>> > +		struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>> >  		struct per_cpu_nodestat *n;
>> >  
>> > +		/* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
>> > +		 * entering NOHZ full, see quiet_vmstat. so, we check pcp
>> > +		 * high_{min,max} to determine whether it is necessary to run
>> > +		 * decay_pcp_high on the corresponding CPU
>> > +		 */
>> 
>> Please follow the comments coding style.
>> 
>>                 /*
>>                  * comments line 1
>>                  * comments line 2
>>                  */
>> 
>
> Thank you for your suggestion. I understand and am ready to make
> some changes
>
>> > +		if (pcp->high_max > pcp->high_min)
>> > +			return true;
>> > +
>> 
>> We don't tune pcp->high_max/min in fact.  Instead, we tune pcp->high.
>> Your code may make need_update() return true in most cases.
>
> You are right, using high_max is incorrect. May i use pcp->high > pcp->high_min?
>
>> 
>> >  		/*
>> >  		 * The fast way of checking if there are any vmstat diffs.
>> >  		 */

--
Best Regards,
Huang, Ying