Re: [EXT] Re: [PATCH] mm: introduce sysctl file to flush per-cpu vmstat statistics

Alex Belits <abelits@xxxxxxxxxxx> · Thu, 3 Dec 2020 22:21:47 +0000

On Mon, 2020-11-30 at 15:18 -0300, Marcelo Tosatti wrote:

> Two questions:
> 
> What is the reason for not allowing _any_ interruption? Because 
> it might be the case (and it is in the vRAN use cases our customers
> target), that some amount of interruption is tolerable.
> 
> For example, for one use case 20us (of interruptions) every 1ms of 
> elapsed time is tolerable.

I agree that for many applications that would be sufficient, and if so,
we can avoid the rest of the mechanism, However I know three types of
applications where full isolation is absolutely necessary.

1. Physical process control. A program continuously polls data from
sensors, processes it, determines the state of controlled device or
process, and issues commands or tunes parameters, controlling it. An
unpredictable delay caused by an interrupt or other
performance-affecting event can break the timing of this mechanism.

Even taking into account that usually safety requirements demand that
some low-latency reaction should be performed without CPU involvement,
some of the control logic may demand involvement of a CPU usually
within a closed control loop. For example, tracking the state of an RF
connection and adjusting the signal, analysis of measurements and
control of a mechanical system, including elasticity and vibration in
industrial and transportation applications, all require a large amount
of calculations, and benefit from reduced amount of buffering and
delays. In case of RF this allows to maintain more reliable
connections. Mechanical systems can have greater precision and lower
stress on the components because such system can easily implement
dampening and compensation of deformation and deterioration.

Usually when such processing is necessary, the "fast" part of it is
performed on a microcontroller, separate CPU core built into an ASIC or
even a "soft CPU" implemented on an FPGA, while "slow" calculations are
performed by a large CPU or SoC. This, however, places severe
limitations on the software design and requires engineers to waste
"expensive" resources such as including PCIe links into
microcontrollers, building ASICs, implementing soft CPU cores on FPGA,
etc. while the use of CPU cores of modern CPUs and SoC is limited.
Sometimes a large processing mechanism is built out of DSP blocks of
FPGA and its built-in static RAM, while more capable SIMD of a CPU
remains under-used.

2. Audio processing. A program implements a set of filters and mixers,
processing multiple audio streams in a live performance. Processing
should be synchronized not only between individual microphones,
speakers and processors, but, when used as a part of sound
reinforcement system, also with propagation of sound through the
environment that bypasses the audio system but is subject to the
acoustics of the building. The amount of data is large, however
calculations are always the same, so it is possible to tune all delays
to match the projected or measured timing. Program runs at the sample
rate of the signals, with processing algorithms already creating
delays, so additional buffering would place total delay out of the
available range.

This can be seen as a subset of controlling the mechanical system,
except usually a control loop, if present, is much slower (such as
adapting to changing acoustical properties), however processing of
incoming data includes very large amounts of calculations that should
be performed on a full-featured CPU. The total amount of data that
should be available to a CPU core at a time, usually can fit into its
cache, and if sufficient cache allocation or partitioning is available,
all necessary calculations (digital filters, mixing, etc.) can be
reliably performed with minimal latency. Interrupts break this model,
causing unpredictable cache invalidation, so design has to take into
account not just the time spent processing the interrupt but also
delays for re-filling the caches with data that normally remains there.

3. Parallel handling of network packets. Modern network-oriented SoCs
have hardware scheduling mechanisms that can distribute packets
processing, including packet processing within the same flow, to
multiple CPU cores, form processing pipelines, etc. Flow-related state,
if present, can be kept within the scheduling mechanism, and not
handled by CPU cores. When CPUs work in a predictable manner, this
allows fast processing of network traffic regardless of the
distribution of the amount of data between flows or other details, save
for situations that warrant some kind of additional per-flow processing
that can't be performed in a stateless manner (and that may be still
offloaded to dedicated CPU cores for "slow path"). This is
fundamentally different from the model in which each flow is kept on a
single CPU for stateful processing and buffering.

When interrupt happens, a core is temporarily excluded from this
orchestrated process, and the packets that it handles stay in buffer,
possibly keeping the whole flows waiting for reordering before the
output. Every time this happens, buffers end up holding more data than
they would be without interruption, so the whole system can provide
lower throughput before it has to drop packets. Handling network
buffers and CPU access to them is usually extremely optimized in such
system, CPU usually only accesses a subset of header data in network
buffers, then makes whatever decisions that are not already done by a
complex system of hardware header parsers, CAMs and queues that precede
and follow this CPU-based processing.

In many cases all three types applications suffer from insufficiently
flexible cache management, however recently many CPUs and SoCs are
built specifically for the above mentioned purposes, and more complex
cache management can be easily added to them. When implementing those
it's much easier to allow optimization for running an application with
predictable limits on access, that will fall back into un-optimized
state when a predictable pattern is broken. This means, designers of
both hardware and software can breathe much easier if OS can guarantee
interrupt-free environment for timing-critical tasks. Outside of those
tasks it would be completely harmless for cache to get out of the
optimized state as long as consistency is maintained by hardware, so
there is no reason to expect any incompatibility or violation of
specifications that would require any special software support. However
as a person working on the software side of a hardware company, I can't
request such a feature when I know that whatever they will build, will
have to crash down like a Jenga tower every second, leave alone
milliseconds.

To support this kind of tasks, various companies used either full-
featured RTOS or "bare metal environments" that are basically
specialized extremely reduced OSes, designed to run up to one
application per CPU core with some minimal initialization and resource
allocation. While I agree that this is not the place to evaluate merit
of those approaches in general, I can say that my recent experience
with both developers and users, including but not limited to myself
being in both roles, shows that many would rather prefer if Linux
provided this environment when running some tasks after initialization
in a normal mode. If the task can go back and forth between isolated
and non-isolated mode, that would greatly simplify re-configuration,
and any mechanism for communicating outside of isolated environment can
be a bonus, however isolation itself is the important part.

This was not the case in early 2000s, when embedded systems were built
on single-core SoCs and microcontrollers. Then either everything had to
be integrated in a single task under "bare metal", or everything
depended on interrupt processing provided by the OS. Now, when network
and embedded-oriented chips are built with tens of cores, sometimes
it's better if OS prepared everything and left the task, along with its
core and devices that it handles, alone and undisturbed. And if this
can be done entirely within Linux, consistently and integrated with
other related features and design of its subsystems, I think, this is
justified.

> 
> > Since I first and foremost care about eliminating all disturbances
> > for
> > a running userspace task,
> 
> Why?

Mostly because I work on (3) from the list above. However experience
with other mentioned uses contributed to that, too. I think, some very
useful work is already done in Linux for limiting interrupts and
disturbances in general and for isolated CPUs in particular. One
additional step that will allow to provide a disturbances-free
environment and greatly expand possible applications of Linux kernel
looks to me like a useful applications of more efforts.

> 
> > my approach is to allow disabling everything
> > including "unavoidable" synchronization IPIs, and make kernel entry
> > procedure recognize that some delayed synchronization is necessary
> > while avoiding race conditions. As far as I can tell, not everyone
> > wants to go that far, 
> 
> Suppose it depends on how long each interruption takes. Upon the 
> suggestion from Thomas and Frederic, i've been thinking it should 
> be possible, with the tracing framework, to record the length of 
> all interruptions to a given CPU and, every second check how many 
> have happened (and whether the sum of interruptions exceeds the 
> acceptable threshold).
> 
> Its not as nice as the task isolation patch (in terms of spotting
> the culprit, since one won't get a backtrace on the originating
> CPU in case of an IPI for example), but it should be possible 
> to verify that the latency threshold is not broken (which is
> what the application is interested in, isnt it?).

With the current version of the isolation patch I have completely
removed the diagnostics part of the code. I want to re-introduce a much
cleaner implementation of "interrupt cause diagnostics" on top of it,
mostly for the purpose of development -- at least I think, it will give
me a better tool to chase the root causes of interrupts that task
isolation can see.

Now that I have functions where all kernel entry and exit passes for
isolated tasks, and some easily readable indication of an isolated CPU,
it should be easy to collect both "We are going to send an IPI to a
task in this special state because..." and "We are doing something
specific in kernel, and it happens, our task was in this special
state". It may be that just using tracing this can be done as well,
however if we will expand conditions for cause recording beyond just
fully isolated task, a separate recording mechanism would be able to
collect useful information and statistics with less of its own impact
on latency and performance. At least, I think, it makes sense to try
separately from task isolation implementation itself, that no longer
depends on knowing why kernel was entered.

As for measuring latency, as I have mentioned before, it's not just the
impact of time being spent in kernel in interrupt processing that
matters, it's also all the memory that is being accessed in the process
of doing so, and whatever performance-affecting changes, such as TLB
flushes or expected delayed work, that made the call necessary in the
first place. And for non-isolated tasks there is also a possibility of
preemption that can further complicate the picture with timing.

> 
> > and it may make sense to allow "almost isolated
> > tasks" that still receive normal interrupts, including IPIs and
> > page
> > faults. 
> > 
> > While that would be useless for the purposes that task
> > isolation patch was developed for, I recognize that some might
> > prefer
> > that to be one of the options set by the same prctl call. This
> > still
> > remains close enough to the design of task isolation -- same idea
> > of
> > something that affects CPU but being tied to a given task (and
> > dying
> > with it), same model of handling attributes, etc.
> > 
> > Maybe there can be a mask of what we do and don't want to avoid for
> > the
> > task. Say, some may want to only allow page faults or syscalls. Or
> > re-
> > enter isolation on breaking without notifying the userspace.
> 
> OK, i will try to come up with an interface that allows additional 
> attributes - please review and let me know if it works for task
> isolation patchset.

I think, in addition to enable/disable and configuring a signal we will
need some flags that enable and disable features, such as full
isolation and avoidance of specific events and causes, logging (if
supported), etc.

> 
> Can you talk a little about the signal handling part? What type
> of applications are expected to perform once isolation is broken?

First and foremost, a task may be killed because a signal was sent to
it by another task, terminal, socket or pipe. Then it's completely
normal signal processing, possibly a userspace cleanup, reloading,
etc., except isolation breaking happens on kernel entry.

Second, application may encounter an early exit from isolation
immediately after entering because some kernel event that was supposed
to cause IPIs to all non-isolated tasks, happened while it was
entering. Then all it has to do is re-enter.

Third, and this is what I do in libtmc, isolation breaking is handled
by notifying a separate isolation manager thread or process, so it can
check the state of tasks and timers, and decide if and when it should
attempt re-enter, ask the task to leave and re-enter the CPU, etc. I
will be happier if those actions became unnecessary, however for now
it's the easiest way to check if anything and what in particular may
prevent the task from entering isolation. Maybe it will make sense to
make libtmc use ptrace() on a task from isolation manager, so isolation
manager will handle that signal when necessary, and then notify the
task.

As an alternative I have added a simple readable isolation mask that
can be checked by processes (ones that are not isolated) such as
isolation manager, and if there will be a separate mechanism for
reporting isolation breaking, it may be used instead of the signal (and
then isolation manager will be able to communicate with the task using
shared memory, signal or other means).

The original patch used a signal, and I can't predict what else users
may want to do on isolation breaking, so I have kept that interface
unchanged.

-- 
Alex