On Wed, Aug 7, 2024 at 9:04 PM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > > On Wed, Aug 7, 2024 at 10:59 PM Muchun Song <muchun.song@xxxxxxxxx> wrote: > > > > > > > > > On Aug 8, 2024, at 05:19, Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > > > > > > Currently, when memory is hot-plugged or hot-removed the accounting is > > > done based on the assumption that memmap is allocated from the same node > > > as the hot-plugged/hot-removed memory, which is not always the case. > > > > > > In addition, there are challenges with keeping the node id of the memory > > > that is being remove to the time when memmap accounting is actually > > > performed: since this is done after remove_pfn_range_from_zone(), and > > > also after remove_memory_block_devices(). Meaning that we cannot use > > > pgdat nor walking though memblocks to get the nid. > > > > > > Given all of that, account the memmap overhead system wide instead. > > > > Hi Pasha, > > > > You've changed it to vm event mechanism. But I found a comment (below) say > > "CONFIG_VM_EVENT_COUNTERS". I do not know why it has such a rule > > sice 2006. Now the rule should be changed, is there any effect to users of > > /proc/vmstat? > > There should not be any effect on the users of the /proc/vmstat, the > values for nr_memap and nr_memmap_boot before and after are still in > /proc/vmstat under the same names. > > > > > /* > > * Light weight per cpu counter implementation. > > * > > * Counters should only be incremented and no critical kernel component > > * should rely on the counter values. > > * > > * Counters are handled completely inline. On many platforms the code > > * generated will simply be the increment of a global address. > > Thank you for noticing this. Based on my digging, it looks like this > comment means that the increment only produces the most efficient code > on some architectures (i.e. i386, ia64): > > Here is the original commit message from 6/30/06: > f8891e5e1f93a1 [PATCH] Lightweight event counters > > Relevant information: > The implementation of these counters is through inline code that hopefully > results in only a single instruction increment instruction being emitted > (i386, x86_64) or in the increment being hidden though instruction > concurrency (EPIC architectures such as ia64 can get that done). > > My patch does not change anything in other places where vm_events are > used, so it won't introduce performance regression anywhere. Memmap, > increment and decrement can happen based on the value of delta. I have > tested, and it works correctly. Perhaps we should update the comment. I think there may be a semantic inconsistency here. I am not so sure about this code, but for memcg stats, there is a semantic distinction between stat (or state) and event. Per-memcg events (which are a subset of NR_VM_EVENT_ITEMS) are basically counting the number of times a certain event happened (e.g. PGFAULT). This naturally cannot be decremented because the number of page faults that happened cannot decrease. Per-memcg state are things that represent the current state of the system (e.g. NR_SWAPCACHE). This can naturally go up or down. It seems like the code here follows the same semantics, and this change breaks that. Also, now these stats depend on CONFIG_VM_EVENT_COUNTERS . Looking at NR_VMSTAT_ITEMS, it looks like it's composed of: NR_VM_ZONE_STAT_ITEMS, NR_VM_NUMA_EVENT_ITEMS, NR_VM_NODE_STAT_ITEMS, NR_VM_WRITEBACK_STAT_ITEMS, NR_VM_EVENT_ITEMS (with CONFIG_VM_EVENT_COUNTERS) Semantically, the memmap stats do not fit into any of the above categories if we do not want them to be per-node. Maybe they should have their own category like NR_VM_WRITEBACK_STAT_ITEMS, or maybe we should consolidate both of them into a global stat items category (e.g. NR_VM_STAT_ITEMS)?