On Thu, Aug 8, 2024 at 1:36 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Wed, Aug 7, 2024 at 9:04 PM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > > > > On Wed, Aug 7, 2024 at 10:59 PM Muchun Song <muchun.song@xxxxxxxxx> wrote: > > > > > > > > > > > > > On Aug 8, 2024, at 05:19, Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > > > > > > > > Currently, when memory is hot-plugged or hot-removed the accounting is > > > > done based on the assumption that memmap is allocated from the same node > > > > as the hot-plugged/hot-removed memory, which is not always the case. > > > > > > > > In addition, there are challenges with keeping the node id of the memory > > > > that is being remove to the time when memmap accounting is actually > > > > performed: since this is done after remove_pfn_range_from_zone(), and > > > > also after remove_memory_block_devices(). Meaning that we cannot use > > > > pgdat nor walking though memblocks to get the nid. > > > > > > > > Given all of that, account the memmap overhead system wide instead. > > > > > > Hi Pasha, > > > > > > You've changed it to vm event mechanism. But I found a comment (below) say > > > "CONFIG_VM_EVENT_COUNTERS". I do not know why it has such a rule > > > sice 2006. Now the rule should be changed, is there any effect to users of > > > /proc/vmstat? > > > > There should not be any effect on the users of the /proc/vmstat, the > > values for nr_memap and nr_memmap_boot before and after are still in > > /proc/vmstat under the same names. > > > > > > > > /* > > > * Light weight per cpu counter implementation. > > > * > > > * Counters should only be incremented and no critical kernel component > > > * should rely on the counter values. > > > * > > > * Counters are handled completely inline. On many platforms the code > > > * generated will simply be the increment of a global address. > > > > Thank you for noticing this. Based on my digging, it looks like this > > comment means that the increment only produces the most efficient code > > on some architectures (i.e. i386, ia64): > > > > Here is the original commit message from 6/30/06: > > f8891e5e1f93a1 [PATCH] Lightweight event counters > > > > Relevant information: > > The implementation of these counters is through inline code that hopefully > > results in only a single instruction increment instruction being emitted > > (i386, x86_64) or in the increment being hidden though instruction > > concurrency (EPIC architectures such as ia64 can get that done). > > > > My patch does not change anything in other places where vm_events are > > used, so it won't introduce performance regression anywhere. Memmap, > > increment and decrement can happen based on the value of delta. I have > > tested, and it works correctly. Perhaps we should update the comment. > > I think there may be a semantic inconsistency here. > > I am not so sure about this code, but for memcg stats, there is a > semantic distinction between stat (or state) and event. > > Per-memcg events (which are a subset of NR_VM_EVENT_ITEMS) are > basically counting the number of times a certain event happened (e.g. > PGFAULT). This naturally cannot be decremented because the number of > page faults that happened cannot decrease.