Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Mon, 16 Jan 2023 13:11:40 -0300

On Mon, Jan 16, 2023 at 10:51:40AM +0100, Christoph Lameter wrote:
> On Wed, 11 Jan 2023, Marcelo Tosatti wrote:
> 
> > OK, can replace this_cpu operations with this_cpu_ptr + standard C operators
> > (and in fact can do that for interrupt disabled functions as well, that
> > is CONFIG_HAVE_CMPXCHG_LOCAL not defined).
> >
> > Is that it?
> 
> No that was hyperthetical.
> 
> I do not know how to get out of this dilemma. We surely want to keep fast
> vmstat operations working.

Honestly, to me, there is no dilemma:

* There is a requirement from applications to be uninterrupted
by operating system activities. Examples include radio access
network software, software defined PLCs for industrial automation (1).

* There exists vm-statistics counters (which count
the number of pages on different states, for example, number of 
free pages, locked pages, pages under writeback, pagetable pages,
file pages, etc). 
To reduce number of accesses to the global counters, each CPU maintains
its own delta relative to the global VM counters
(which can be cached in the local processor cache, therefore fast).

The per-CPU deltas are synchronized to global counters:
	1) If the per-CPU counters exceed a given threshold.
	2) Periodically, with a low frequency compared to 
	   CPU events (every number of seconds).
	3) Upon an event that requires accurate counters.

* The periodic synchronization interrupts a given CPU, in case
it contains a counter delta relative to the global counters.

To avoid this interruption, due to [1], the proposed patchset
synchronizes any pending per-CPU deltas to global counters,
for nohz_full= CPUs, when returning to userspace
(which is a very fast path).

Since return to userspace is a very fast path, synchronizing
per-CPU counter deltas by reading their contents is undesired.

Therefore a single bit is introduced to compact the following
information: does this CPU contain any delta relative to the
global counters that should be written back?

This bit is set when a per-CPU delta is increased.
This bit is cleared when the per-CPU deltas are written back to 
the global counters.

Since for the following two operations:

	modify per-CPU delta (for current CPU) of counter X by Y
	set bit (for current CPU) indicating the per-CPU delta exists

"current CPU" can change, it is necessary to disable CPU preemption
when executing the pair of operations.

vmstat operations still perform their main goal which is to 
maintain accesses local to the CPU when incrementing the counters
(for most of counter modifications).
The preempt_disable/enable pair is also a per-CPU variable.

Now you are objecting to this patchset because:

It increases the number of cycles to execute the function to modify 
the counters by 6. Can you mention any benchmark where this 
increase is significant?

By searching for mod_zone_page_state/mode_node_page_state one can see
the following: the codepaths that call them are touching multiple pages
and other data structures, so the preempt_enable/preempt_disable
pair should be a very small contribution (in terms of percentage) to any
meaningful benchmark.

> The fundamental issue that causes the vmstat discrepancies is likely that
> the fast this_cpu ops can increment the counter on any random cpu and that
> this is the reason you get vmstat discrepancies.

Yes. 

> Give up the assumption that an increment of a this_cpu counter on a
> specific cpu means that something occurred on that specific cpu. Maybe
> that will get you on a path to resolve the issues you are seeing.

But it can't. To be able to condense the information "does a delta exist
on this CPU" from a number of cacheline reads to a single cacheline
read, one bit can be used. And the write to that bit and to the counters
is not atomic.

Alternatives:
	1) Disable periodic synchronization for nohz_full CPUs.
	2) Processor instructions which can modify more than
	   one address in memory.
	3) Synchronize the per-CPU stats remotely (which
	   increases per-CPU and per-node accesses).