Re: [EXT] Re: [PATCH] mm: introduce sysctl file to flush per-cpu vmstat statistics

Alex Belits <abelits@xxxxxxxxxxx> · Fri, 4 Dec 2020 01:43:28 +0000

On Wed, 2020-12-02 at 16:57 +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 09:31, Christoph Lameter wrote:
> > On Fri, 27 Nov 2020, Marcelo Tosatti wrote:
> > 
> > > Decided to switch to prctl interface, and then it starts
> > > to become similar to "task mode isolation" patchset API.
> > 
> > Right I think that was a good approach.
> 
> prctl() is the right thing to do.
> 
> > > In addition to quiescing pending activities on the CPU, it would
> > > also be useful to assign a per-task attribute (which is then
> > > assigned
> > > to a per-CPU attribute), indicating whether that CPU is running
> > > an isolated task or not.
> > 
> > Sounds good but what would this do? Give a warning like the
> > isolation
> > patchset?
> 

Isolation patch right now produces log warnings because there is no
application interface for it yet. I am trying to make something
suitable for applications, so they can react to it, collect statistics
and produce human-readable logs. It is still not entirely clear, what
should be included and in which cases (other than isolation breaking)
monitored events should be recorded, so for now and for the purposes of
task isolation alone I have kept ugly-looking messages.

Two things that I want to add are:

1. "Cause" that describes what kind of event happened (generic IPI,
specific call, exception, page fault, timer, interrupt, syscall entry)
with some relevant data. Or default "we have entered kernel and now
exiting, however we have no idea what exactly was called in between"
cause, to catch everything that is not tracked yet. The "relevant data"
part is what can make this feature useful (handlers of specific events
always know what they are), however it should be chosen carefully.

2. Optional "remote cause" that describes why did someone sent an IPI
to this CPU (to be picked when IPI is being processed). This may
include something like description like a normal cause, or a backtrace.
If we know that we are about to kick a task out of isolation, we can
just as well include an explanation why.

> This all needs a lot more thought about the overall picture. We
> already
> have too many knobs and ad hoc hooks which fiddle with isolation.
> 
> The current CPU isolation is a best effort approach and I agree that
> for
> more strict isolation modes we need to be able to enforce that and
> hunt
> down offenders and think about them one by one.
> 
> > > To be called before real time loop, one would have:
> 
> Can we please agree in the first place, that "real time" is
> absolutely
> the wrong term here?

We can call it "isolated mode" or "isolated user mode", so it would be
clear that it is the opposite to anything related to interrupts and
preemption.

> It's about running undisturbed CPU bound computations whatever nature
> they are. It does not matter whether that loop does busy polling ala
> DPDK, whether it runs a huge math computation on a data set or
> whatever people come up with.
> 
> > > 	prctl(PR_SET_TASK_ISOLATION, ISOLATION_ENABLE) [1]
> > > 	real time loop
> > > 	prctl(PR_SET_TASK_ISOLATION, ISOLATION_DISABLE)
> > > 
> > > (with the attribute also being cleared on task exit).
> > > 
> > > The general description would be:
> > > 
> > > "Set task isolated mode for a given task, returning an error
> > > if the task is not pinned to a single CPU.
> 
> Plus returning an error if the task has no permissions to request
> this. This should not be an unprivileged prctl ever.

If we want a consistent interface, it can also return corresponding
error if the task has permissions but can't because, say, there are
multiple tasks all bound to the same CPU.

> > > In this mode, the kernel will avoid interruptions to isolated
> > > CPUs when possible."
> > > 
> > > Any objections against such an interface ?
> > 
> > Maybe do both like in the isolation patchset?
> 
> We really want to define the scopes first. And here you go:
> 
> > Often code can tolerate a few interruptions (in some code branches
> > regular syscalls may be needed) but one wants the thread to be
> > as quiet as possible.
> 
> So you say some code can tolerate a few interrupts, then comes Alex
> and
> says 'no disturbance' at all.

I agree. I would prefer a shared interface to closely related
functionality even if parts of implementation vary, or may not even be
present / enabled at the same time.

> 
> The point is that all of this shares the mechanisms to quiesce
> certain
> parts of the kernel so this wants to build common infrastructure and
> the
> prctl(ISOLATION, MODE) mode argument defines the scope of isolation
> which the task asks for and the infrastructure decides whether it can
> be
> granted and if so orchestrates the operation and provides a common
> infrastructure for instrumentation, violation monitoring etc.
> 
> We really need to stop to look at particular workloads and defining
> adhoc solutions tailored to their particular itch if we don't want to
> end up with an uncoordinated and unmaintainable zoo of interfaces,
> hooks
> and knobs.
> 
> Just looking at the problem at hand as an example. NOHZ already
> issues
> quiet_vmstat(), but it does not cancel already scheduled work. Now
> Marcelo wants a new mechanism which is supposed to cancel the work
> and
> then Alex want's to prevent it from being rescheduled. If that's not
> properly coordinated this goes down the drain very fast.
> 
> So can we please come up with a central place to handle this prctl()
> with a future proof argument list so the various isolation needs can
> be
> expressed as required?
> 
> That allows Marcelo to start tackling the vmstat side and Alex can
> utilize that and build the other parts into it piece by piece.

Right. I think, we should choose, which parameters should be handled as
flags and which as mutually exclusive modes, if any. For example,
enabling diagnostics / recording causes can be a flag, so if we really
want, we can collect interrupt logging while our task is running even
without isolation (but then there would be a large amount of noise
there). Maybe someone would want to do that on tasks that it
ptrace()'s?

Recording may have filtering, for example, "remote cause" may be
disabled, or some class of events, such as page faults, ignored or
minimally recorded only with cause type.

Since the original interface uses signals, maybe we can keep signal
number as a common feature for notification of isolation breaking or
exceeding some parameters if such a thing applies.

We may want to set a separate flag to enable automatic re-entering
isolation when broken and notification by signal. For some isolated
tasks either of those features can be useful or not. "Best effort"
modes where interrupts are expected, would by default remain active
until turned off, and with no signals.

Since there is a great variety of those, and more may be added, we
probably should have a "query" call that will return a mask of what is
supported on the system or available to the given task.

-- 
Alex