On Mon, Jan 16, 2023 at 10:51:40AM +0100, Christoph Lameter wrote: > On Wed, 11 Jan 2023, Marcelo Tosatti wrote: > > > OK, can replace this_cpu operations with this_cpu_ptr + standard C operators > > (and in fact can do that for interrupt disabled functions as well, that > > is CONFIG_HAVE_CMPXCHG_LOCAL not defined). > > > > Is that it? > > No that was hyperthetical. > > I do not know how to get out of this dilemma. We surely want to keep fast > vmstat operations working. Honestly, to me, there is no dilemma: * There is a requirement from applications to be uninterrupted by operating system activities. Examples include radio access network software, software defined PLCs for industrial automation (1). * There exists vm-statistics counters (which count the number of pages on different states, for example, number of free pages, locked pages, pages under writeback, pagetable pages, file pages, etc). To reduce number of accesses to the global counters, each CPU maintains its own delta relative to the global VM counters (which can be cached in the local processor cache, therefore fast). The per-CPU deltas are synchronized to global counters: 1) If the per-CPU counters exceed a given threshold. 2) Periodically, with a low frequency compared to CPU events (every number of seconds). 3) Upon an event that requires accurate counters. * The periodic synchronization interrupts a given CPU, in case it contains a counter delta relative to the global counters. To avoid this interruption, due to [1], the proposed patchset synchronizes any pending per-CPU deltas to global counters, for nohz_full= CPUs, when returning to userspace (which is a very fast path). Since return to userspace is a very fast path, synchronizing per-CPU counter deltas by reading their contents is undesired. Therefore a single bit is introduced to compact the following information: does this CPU contain any delta relative to the global counters that should be written back? This bit is set when a per-CPU delta is increased. This bit is cleared when the per-CPU deltas are written back to the global counters. Since for the following two operations: modify per-CPU delta (for current CPU) of counter X by Y set bit (for current CPU) indicating the per-CPU delta exists "current CPU" can change, it is necessary to disable CPU preemption when executing the pair of operations. vmstat operations still perform their main goal which is to maintain accesses local to the CPU when incrementing the counters (for most of counter modifications). The preempt_disable/enable pair is also a per-CPU variable. Now you are objecting to this patchset because: It increases the number of cycles to execute the function to modify the counters by 6. Can you mention any benchmark where this increase is significant? By searching for mod_zone_page_state/mode_node_page_state one can see the following: the codepaths that call them are touching multiple pages and other data structures, so the preempt_enable/preempt_disable pair should be a very small contribution (in terms of percentage) to any meaningful benchmark. > The fundamental issue that causes the vmstat discrepancies is likely that > the fast this_cpu ops can increment the counter on any random cpu and that > this is the reason you get vmstat discrepancies. Yes. > Give up the assumption that an increment of a this_cpu counter on a > specific cpu means that something occurred on that specific cpu. Maybe > that will get you on a path to resolve the issues you are seeing. But it can't. To be able to condense the information "does a delta exist on this CPU" from a number of cacheline reads to a single cacheline read, one bit can be used. And the write to that bit and to the counters is not atomic. Alternatives: 1) Disable periodic synchronization for nohz_full CPUs. 2) Processor instructions which can modify more than one address in memory. 3) Synchronize the per-CPU stats remotely (which increases per-CPU and per-node accesses).