Re: [PATCH] mm: introduce sysctl file to flush per-cpu vmstat statistics

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Thu, 03 Dec 2020 04:17:36 +0100

On Wed, Dec 02 2020 at 17:43, Christoph Lameter wrote:
> On Wed, 2 Dec 2020, Thomas Gleixner wrote:
>
>> prctl() is the right thing to do.
>
> Ok great consensus on that one.

That's the easy part :)

>> The current CPU isolation is a best effort approach and I agree that for
>> more strict isolation modes we need to be able to enforce that and hunt
>> down offenders and think about them one by one.
>
> There are two apprahces actually to make the OS quiet. One is the best
> effort approach which is more like the current NOHZ one with additional
> actions to flush things. The other is the strict approach were one wants a
> guarantee that the OS does not do anything at all.

And here the consensus stops again :)

The point is that between the relaxed best effort / heuristics based
scenario and the 'user space task asks for absolute silence' scenario is
a huge difference:

  Is this really a black and white decision?

  Definitely not. That would be again an imposed policy decision which is
  wrong to begin with. We burnt ourself with that over and over so can
  we please and if it's just for this particular problem learn from
  history?

  The kernel provides mechanisms but does not impose policies unless
  there is no other choice.

  And as we know that there are quite some shades of grey, there is lots
  of choice and we need to come up with solutions for delegating the
  policy decision to the user/admin and not just provide a off/on knob.

This 'isolate either perhaps or everything' appraoch is just wrong. The
partisan thinking is obviously popular in the US, but it has no business
in making technically sensible desicions.

>> So you say some code can tolerate a few interrupts, then comes Alex and
>> says 'no disturbance' at all.
>
> Yes that is the current NOHZ approach.  You switch it on and after the OS
> detects are polling loop it will quiet things down. Instead of detecting
> it we are actively telling the OS to quiet down now.

Kinda. We want to provide mechanisms to quiet certain aspects of the OS
and to enable enforcement of that, but again, that's not on/off it has
to be configurable / selectable.

Again: I fundamentaly disagree with the proposed task isolation patches
approach as they leave no choice at all.

There is a reasonable middle ground where an application is willing to
pay the price (delay) until the reqested quiescing has taken place in
order to run undisturbed (hint: cache ...) and also is willing to take
the addtional overhead of an occacional syscall in the slow path without
tripping some OS imposed isolation safe guard.

Aside of that such a granular approach does not necessarily require the
application to be aware of it. If the admin knows the computational
pattern of the application, e.g.

 1     read_data_set() <- involving syscalls/OS obviously
 2     compute_set()   <- let me alone
 3     save_data_set() <- involving syscalls/OS obviously

       repeat the above...

then it's at his discretion to decide to inflict a particular isolation
set on the task which is obviously ineffective while doing #1 and #3 but
might provide the so desired 0.9% boost for compute_set() which
dominates the judgement.

That's what we need to think about and once we figured out how to do
that it gives Marcelo the mechanism to solve his 'run virt undisturbed
by vmstat or whatever' problem and it allows Alex to build his stuff on
it.

Summary: The problem to be solved cannot be restricted to

    self_defined_important_task(OWN_WORLD);

Policy is not a binary on/off problem. It's manifold across all levels
of the stack and only a kernel problem when it comes down to the last
line of defence.

Up to the point where the kernel puts the line of last defence, policy
is defined by the user/admin via mechanims provided by the kernel.

Emphasis on "mechanims provided by the kernel", aka. user API.

Just in case, I hope that I don't have to explain what level of scrunity
and thought this requires.

Thanks,

        tglx