Re: [RFC] tentative prctl task isolation interface

Alex Belits <abelits@xxxxxxxxxxx> · Mon, 1 Feb 2021 04:47:04 -0800

On 2/1/21 02:48, Christoph Lameter wrote:
Notifications:
-------------

Notification mode of isolation breakage can be configured as follows:

- None (default): No notification is performed by the kernel on isolation
   breakage.

- Syslog: Isolation breakage is reported to syslog.

Syslog is intended for humans, and isn't useful for userspace software 
processing. Since there are at least some cases then isolation breaking 
is unavoidable on startup (benign race of isolation entering with 
isolation-breaking event, register-mapping page fault), I would rather 
allow completely automated processing of those events. Signal interface 
does that now, however I think, it would help to associate 
software-handled events with either software-identifiable "cause type" 
(ex: "scheduling timer" or "page fault") or more verbose human-readable 
"cause description" (ex: IPI received, and here is the sender CPU's 
stack dump that led to this IPI being sent).

The former ("cause") may be important for software (for example, it may 
want to have special processing of page faults for device registers), 
while the latter ("description") is more useful when it can be 
associated with particular event in userspace without manual log timing 
comparison and guesswork.

- Abort with core dump

I would use an existing signal interface for that, with user-defined 
signal. The user can choose to handle the signal, ignore it, let it kill 
the task with or without a core dump.

Oh, and if user wants, he can use ptrace() to delegate this signal to 
some other process.

This is useful for debugging and for hard core bare metalers that never
want any interrupts.

One particular issue are page faults.  One would have to prefault the
binary executable functions in order to avoid "interruptions" through page
faults. Are these proper interrutions of the code? Certainly major faults
are but minor faults may be ok? Dunno.

In practice what I have often seen in such apps is that there is a "warm"
up mode where all critical functions are executed, all important variables
are touched and dummy I/Os are performed in order to populate the caches
and prefault all the data.I guess one would run these without isolation
first and then switch on some sort of isolation mode after warm up. So far
I think most people relied on the timer interrupt etc etc to be turned off
after a few secs of just running throught a polling loop without any OS
activities.

This is usually done not as much for page preloading but for cache. 
There is mlock() and mlockall() that load and lock pages explicitly. One 
exception is device registers -- they may remain unmapped until accessed.

I can often see a pattern when application enters isolation, calls 
low-level library such as ODP, gets a page fault, leaves and re-enters 
isolation, and then everything is running perfectly because everything 
is mapped. However in those cases mlockall() is done before entering 
isolation, so regular memory mapping is already there.

I ended up implementing a manager/helper task that talks to tasks over a
socket (when they are not isolated) and over ring buffers in shared memory
(when they are isolated). While the current implementation is rather
limited, the intention is to delegate to it everything that isolated task
either can't do at all (like, writing logs) or that it would be cumbersome
to implement (like monitoring the state of task, determining presence of
deferred work after the task returned to userspace), etc.

Interesting. Are you considering opensourcing such library? Seems like a
generic problem.

It's already open source, https://github.com/abelits/libtmc

It still needs some work. At the moment it does more than I would prefer 
because it tries to detect possible problems, such as running timers, 
and at the same time does not provide some obviously useful things like 
asynchronous interface to arbitrary file I/O.

I also want to allow the use of some generic interface to triggering 
interrupts from isolated task to the manager (through, say, a sacrifice 
of a single GPIO), so if this option is available, the manager won't 
have to do all that polling.

Well everyone swears on having the right implementation. The people I know
would not do any thing with a socket in such situations. They would only
use shared memory and direct access to I/O devices via SPDK and DPDK or
the RDMA subsystem.

Same applies to me. My library uses sockets to communicate when the task 
is not isolated, and it will be necessary if we want to have a dedicated 
manager process instead of a manager thread in every process. I would 
prefer initiating a connection with a manager through a socket, and only 
after that succeeds, assume that I can use any particular part of shared 
memory (because it means that manager allocated it for me, and no one 
else will race with me trying to touch it).

Blocking? The app should fail if any deferred actions are triggered as a
result of syscalls. It would give a warning with _WARN

There are many supposedly innocent things, nowhere at the scale of CPU
hotplug, that happen in a system and result in synchronization implemented
as an IPI to every online CPU. We should consider them to be an ordinary
occurrence, so there is a choice:

1. Ignore them completely and allow them in isolated mode. This will delay
userspace with no indication and no isolation breaking.

2. Allow them, and notify userspace afterwards (through vdso or through
userspace helper/manager over shared memory). This may be useful in those
rare situations when the consequences of delay can be mitigated afterwards.

3. Make them break isolation, with userspace being notified normally (ex:
with a signal in the current implementation). I guess, can be used if
somehow most of the causes will be eliminated.

4. Prevent them from reaching the target CPU and make sure that whatever
synchronization they are intended to cause, will happen when intended target
CPU will enter to kernel later. Since we may have to synchronize things like
code modification, some of this synchronization has to happen very early on
kernel entry.

Or move the actions to a different victim processor like done with rcu and
vmstat etc etc.

If possible. For most of those things everything can be moved to other 
CPUs when entering for isolation, or not allowed on CPUs intended for 
isolation in the first place (how it's mostly done now). The troublesome 
sources of interruption are things that are legitimately supposed to be 
done on all CPUs at once to synchronize some important kind of state, 
and now we want to delay them on some CPUs until the end of isolation.

I am most interested in (4), so this is what was implemented in my version
of the patch (and currently I am trying to achieve completeness and, if
possible, elegance of the implementation).

Agree. (3) will be necessary as intermediate step. The proposed
improvement to Christoph's reply, in this thread, separates notification
and syscall blockage.

I guess the notification mode will take care of the way we handle these
interruptions.

I think, development should go in parallel -- to have a "delayed 
synchronization on entry" mechanism that allows "no-interruption mode" 
(4) to work given that all interruptions are dealt with (that won't work 
perfectly at first because there are still "unprocessed" sources of 
interruptions) and a notification mechanism that will allow us to find 
and properly process them as (3), so we can exclude them and allow (4). 
Since (4) still requires somewhat intrusive architecture-specific 
changes, there may be some time when (4) will be only available on some 
CPUs, but (3) will work on everything.

--
Alex