Re: [PATCH v2 3/3] mm: don't account memmap per node

Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> · Thu, 8 Aug 2024 10:37:48 -0400



On Thu, Aug 8, 2024 at 1:36 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Wed, Aug 7, 2024 at 9:04 PM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote:
> >
> > On Wed, Aug 7, 2024 at 10:59 PM Muchun Song <muchun.song@xxxxxxxxx> wrote:
> > >
> > >
> > >
> > > > On Aug 8, 2024, at 05:19, Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote:
> > > >
> > > > Currently, when memory is hot-plugged or hot-removed the accounting is
> > > > done based on the assumption that memmap is allocated from the same node
> > > > as the hot-plugged/hot-removed memory, which is not always the case.
> > > >
> > > > In addition, there are challenges with keeping the node id of the memory
> > > > that is being remove to the time when memmap accounting is actually
> > > > performed: since this is done after remove_pfn_range_from_zone(), and
> > > > also after remove_memory_block_devices(). Meaning that we cannot use
> > > > pgdat nor walking though memblocks to get the nid.
> > > >
> > > > Given all of that, account the memmap overhead system wide instead.
> > >
> > > Hi Pasha,
> > >
> > > You've changed it to vm event mechanism. But I found a comment (below) say
> > > "CONFIG_VM_EVENT_COUNTERS". I do not know why it has such a rule
> > > sice 2006. Now the rule should be changed, is there any effect to users of
> > > /proc/vmstat?
> >
> > There should not be any effect on the users of the /proc/vmstat, the
> > values for nr_memap and nr_memmap_boot before and after are still in
> > /proc/vmstat under the same names.
> >
> > >
> > > /*
> > >  * Light weight per cpu counter implementation.
> > >  *
> > >  * Counters should only be incremented and no critical kernel component
> > >  * should rely on the counter values.
> > >  *
> > >  * Counters are handled completely inline. On many platforms the code
> > >  * generated will simply be the increment of a global address.
> >
> > Thank you for noticing this. Based on my digging, it looks like this
> > comment means that the increment only produces the most efficient code
> > on some architectures (i.e. i386, ia64):
> >
> > Here is the original commit message from 6/30/06:
> > f8891e5e1f93a1 [PATCH] Lightweight event counters
> >
> >  Relevant information:
> >   The implementation of these counters is through inline code that hopefully
> >   results in only a single instruction increment instruction being emitted
> >   (i386, x86_64) or in the increment being hidden though instruction
> >   concurrency (EPIC architectures such as ia64 can get that done).
> >
> > My patch does not change anything in other places where vm_events are
> > used, so it won't introduce performance regression anywhere. Memmap,
> > increment and decrement can happen based on the value of delta. I have
> > tested, and it works correctly. Perhaps we should update the comment.
>
> I think there may be a semantic inconsistency here.
>
> I am not so sure about this code, but for memcg stats, there is a
> semantic distinction between stat (or state) and event.
>
> Per-memcg events (which are a subset of NR_VM_EVENT_ITEMS) are
> basically counting the number of times a certain event happened (e.g.
> PGFAULT). This naturally cannot be decremented because the number of
> page faults that happened cannot decrease.