Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Wed, 19 Apr 2023 13:35:12 -0300

Hi Michal,

On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote:
> On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> [...]
> > > 2) Run critical code
> > > 3) Optionally do something once you're done
> > > 
> > > If vmstat is going to be the only thing to wait for on 1), then the remote
> > > solution looks good enough (although I leave that to -mm guys as I'm too
> > > clueless about those matters), 
> > 
> > I am mostly clueless too, but i don't see a problem with the proposed
> > patch (and no one has pointed any problem either).
> 
> I really hate to repeat myself again. The biggest pushback has been on
> a) justification and b) single purpose solution which is very likely
> incomplete. For a) we are getting the story piece by piece which doesn't
> speed up the process. You are proposing a non-trivial change to an
> already convoluted code so having a solid justification is something
> that shouldn't be all that surprising.

The justification is simple and concise:

 2. With a task that busy loops on a given CPU,
    the kworker interruption to execute vmstat_update
    is undesired and may exceed latency thresholds
    for certain applications.

Performance details for the kworker interruption:   

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

       	oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable 
latency threshold for certain applications.

---

An additional use-case is what has been noted by Andrew Theurer:

Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
as measured by cyclictest & oslat.  We cannot achieve under 20 usec with 
the vmstats interruption.

---

It seems to me this is solid justification (it seems you want 
particular microsecond values, but those depend on application
and the CPU). The point is there are several applications which do not
want to be interrupted (we can ignore the specifics and focus on
that fact).

Moreover, unrelated interruptions might occur close in time
(for example, random function call IPIs generated by other
kernel subsystems), which renders the "lets just consider this 
one application, running on this CPU, to decide what to do" 
short sighted.

> b) is what concerns me more though. There are other per-cpu specific
> things going on that require some regular flushing. Just to mention
> another one that your group has been brought up was the memcg pcp
> caches. Again with a non-trivial proposal to deal with that problem
> [1]. 

Yes.

> It has turned out that we can do a simpler thing [2]. 

For the particular memcg case, there was a simpler fix.

For the vmstat_update case, i don't see a simpler fix. 

> I do not think it is a stretch to expect that similar things will pop
> out every now and then

Agree.

> and rather than dealing with each one in its own way it
> kinda makes sense to come up with a more general concept so that all
> those cases can be handled at a single place at least. 

I can understand where you are coming from. Unfortunately,
for some cases it is appropriate to perform the work from a
remote CPU (and i think this is one such case).

> All I hear about
> that is that the code of those special applications would need to be
> changed to use that. 

This is a burden for application writers and for system configuration.

Or it could be done automatically (from outside of the application).
Which is what is described and implemented here:

https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/

"Task isolation is divided in two main steps: configuration and
activation.

Each step can be performed by an external tool or the latency
sensitive application itself. util-linux contains the "chisol" tool
for this purpose."

But not only that, the second thing is:

"> Another important point is this: if an application dirties                                                                          
> its own per-CPU vmstat cache, while performing a system call,                                                                       

Or while handling a VM-exit from a vCPU.

This are, in my mind, sufficient reasons to discard the "flush per-cpu
caches" idea. This is also why i chose to abandon the prctrl interface
patchset.

> and a vmstat sync event is triggered on a different CPU, you'd have to:                                                             
>                                                                                                                                     
> 1) Wait for that CPU to return to userspace and sync its stats                                                                      
> (unfeasible).                                                                                                                       
>                                                                                                                                     
> 2) Queue work to execute on that CPU (undesirable, as that causes                                                                   
> an interruption).                                                                                                                   
>                                                                                                                                     
> 3) Remotely sync the vmstat for that CPU."

So the only option is to remotely sync vmstat for the CPU
(unless you have a better suggestion).

> Well, true but is that bar so impractical that we
> are going to grow kernel complexity and therefore a maintenance burden?

Honestly, this patchset is just using cmpxchg to transfer data from
per-CPU counters to global counters. I don't see why its that 
complicated.

If you have a suggestion on how to reduce the apparent complexity,
that would be great.

> Everything for a very specialized workloads?

Well the kernel has been increasing in complexity, and the maintenance
burden has increased as a side-effect, to accomodate more workloads
than it was initially designed for.

> [1] http://lkml.kernel.org/r/20221102020243.522358-1-leobras@xxxxxxxxxx
> [2] http://lkml.kernel.org/r/20230317134448.11082-1-mhocko@xxxxxxxxxx
> -- 
> Michal Hocko
> SUSE Labs
> 
>