Re: [PATCH v2 3/3] mm: don't account memmap per node

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 7 Aug 2024 22:35:59 -0700

On Wed, Aug 7, 2024 at 9:04 PM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote:
>
> On Wed, Aug 7, 2024 at 10:59 PM Muchun Song <muchun.song@xxxxxxxxx> wrote:
> >
> >
> >
> > > On Aug 8, 2024, at 05:19, Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote:
> > >
> > > Currently, when memory is hot-plugged or hot-removed the accounting is
> > > done based on the assumption that memmap is allocated from the same node
> > > as the hot-plugged/hot-removed memory, which is not always the case.
> > >
> > > In addition, there are challenges with keeping the node id of the memory
> > > that is being remove to the time when memmap accounting is actually
> > > performed: since this is done after remove_pfn_range_from_zone(), and
> > > also after remove_memory_block_devices(). Meaning that we cannot use
> > > pgdat nor walking though memblocks to get the nid.
> > >
> > > Given all of that, account the memmap overhead system wide instead.
> >
> > Hi Pasha,
> >
> > You've changed it to vm event mechanism. But I found a comment (below) say
> > "CONFIG_VM_EVENT_COUNTERS". I do not know why it has such a rule
> > sice 2006. Now the rule should be changed, is there any effect to users of
> > /proc/vmstat?
>
> There should not be any effect on the users of the /proc/vmstat, the
> values for nr_memap and nr_memmap_boot before and after are still in
> /proc/vmstat under the same names.
>
> >
> > /*
> >  * Light weight per cpu counter implementation.
> >  *
> >  * Counters should only be incremented and no critical kernel component
> >  * should rely on the counter values.
> >  *
> >  * Counters are handled completely inline. On many platforms the code
> >  * generated will simply be the increment of a global address.
>
> Thank you for noticing this. Based on my digging, it looks like this
> comment means that the increment only produces the most efficient code
> on some architectures (i.e. i386, ia64):
>
> Here is the original commit message from 6/30/06:
> f8891e5e1f93a1 [PATCH] Lightweight event counters
>
>  Relevant information:
>   The implementation of these counters is through inline code that hopefully
>   results in only a single instruction increment instruction being emitted
>   (i386, x86_64) or in the increment being hidden though instruction
>   concurrency (EPIC architectures such as ia64 can get that done).
>
> My patch does not change anything in other places where vm_events are
> used, so it won't introduce performance regression anywhere. Memmap,
> increment and decrement can happen based on the value of delta. I have
> tested, and it works correctly. Perhaps we should update the comment.

I think there may be a semantic inconsistency here.

I am not so sure about this code, but for memcg stats, there is a
semantic distinction between stat (or state) and event.

Per-memcg events (which are a subset of NR_VM_EVENT_ITEMS) are
basically counting the number of times a certain event happened (e.g.
PGFAULT). This naturally cannot be decremented because the number of
page faults that happened cannot decrease.

Per-memcg state are things that represent the current state of the
system (e.g. NR_SWAPCACHE). This can naturally go up or down.

It seems like the code here follows the same semantics, and this
change breaks that. Also, now these stats depend on
CONFIG_VM_EVENT_COUNTERS .

Looking at NR_VMSTAT_ITEMS, it looks like it's composed of:
NR_VM_ZONE_STAT_ITEMS,
NR_VM_NUMA_EVENT_ITEMS,
NR_VM_NODE_STAT_ITEMS,
NR_VM_WRITEBACK_STAT_ITEMS,
NR_VM_EVENT_ITEMS (with CONFIG_VM_EVENT_COUNTERS)

Semantically, the memmap stats do not fit into any of the above
categories if we do not want them to be per-node. Maybe they should
have their own category like NR_VM_WRITEBACK_STAT_ITEMS, or maybe we
should consolidate both of them into a global stat items category
(e.g. NR_VM_STAT_ITEMS)?