Re: [PATCH v2 3/3] mm: don't account memmap per node

Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> · Thu, 8 Aug 2024 00:04:00 -0400

On Wed, Aug 7, 2024 at 10:59 PM Muchun Song <muchun.song@xxxxxxxxx> wrote:
>
>
>
> > On Aug 8, 2024, at 05:19, Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote:
> >
> > Currently, when memory is hot-plugged or hot-removed the accounting is
> > done based on the assumption that memmap is allocated from the same node
> > as the hot-plugged/hot-removed memory, which is not always the case.
> >
> > In addition, there are challenges with keeping the node id of the memory
> > that is being remove to the time when memmap accounting is actually
> > performed: since this is done after remove_pfn_range_from_zone(), and
> > also after remove_memory_block_devices(). Meaning that we cannot use
> > pgdat nor walking though memblocks to get the nid.
> >
> > Given all of that, account the memmap overhead system wide instead.
>
> Hi Pasha,
>
> You've changed it to vm event mechanism. But I found a comment (below) say
> "CONFIG_VM_EVENT_COUNTERS". I do not know why it has such a rule
> sice 2006. Now the rule should be changed, is there any effect to users of
> /proc/vmstat?

There should not be any effect on the users of the /proc/vmstat, the
values for nr_memap and nr_memmap_boot before and after are still in
/proc/vmstat under the same names.

>
> /*
>  * Light weight per cpu counter implementation.
>  *
>  * Counters should only be incremented and no critical kernel component
>  * should rely on the counter values.
>  *
>  * Counters are handled completely inline. On many platforms the code
>  * generated will simply be the increment of a global address.

Thank you for noticing this. Based on my digging, it looks like this
comment means that the increment only produces the most efficient code
on some architectures (i.e. i386, ia64):

Here is the original commit message from 6/30/06:
f8891e5e1f93a1 [PATCH] Lightweight event counters

 Relevant information:
  The implementation of these counters is through inline code that hopefully
  results in only a single instruction increment instruction being emitted
  (i386, x86_64) or in the increment being hidden though instruction
  concurrency (EPIC architectures such as ia64 can get that done).

My patch does not change anything in other places where vm_events are
used, so it won't introduce performance regression anywhere. Memmap,
increment and decrement can happen based on the value of delta. I have
tested, and it works correctly. Perhaps we should update the comment.

Pasha