On 2/1/21 02:48, Christoph Lameter wrote:
Notifications:
-------------
Notification mode of isolation breakage can be configured as follows:
- None (default): No notification is performed by the kernel on isolation
breakage.
- Syslog: Isolation breakage is reported to syslog.
Syslog is intended for humans, and isn't useful for userspace software
processing. Since there are at least some cases then isolation breaking
is unavoidable on startup (benign race of isolation entering with
isolation-breaking event, register-mapping page fault), I would rather
allow completely automated processing of those events. Signal interface
does that now, however I think, it would help to associate
software-handled events with either software-identifiable "cause type"
(ex: "scheduling timer" or "page fault") or more verbose human-readable
"cause description" (ex: IPI received, and here is the sender CPU's
stack dump that led to this IPI being sent).
The former ("cause") may be important for software (for example, it may
want to have special processing of page faults for device registers),
while the latter ("description") is more useful when it can be
associated with particular event in userspace without manual log timing
comparison and guesswork.
- Abort with core dump
I would use an existing signal interface for that, with user-defined
signal. The user can choose to handle the signal, ignore it, let it kill
the task with or without a core dump.
Oh, and if user wants, he can use ptrace() to delegate this signal to
some other process.
This is useful for debugging and for hard core bare metalers that never
want any interrupts.
One particular issue are page faults. One would have to prefault the
binary executable functions in order to avoid "interruptions" through page
faults. Are these proper interrutions of the code? Certainly major faults
are but minor faults may be ok? Dunno.
In practice what I have often seen in such apps is that there is a "warm"
up mode where all critical functions are executed, all important variables
are touched and dummy I/Os are performed in order to populate the caches
and prefault all the data.I guess one would run these without isolation
first and then switch on some sort of isolation mode after warm up. So far
I think most people relied on the timer interrupt etc etc to be turned off
after a few secs of just running throught a polling loop without any OS
activities.
This is usually done not as much for page preloading but for cache.
There is mlock() and mlockall() that load and lock pages explicitly. One
exception is device registers -- they may remain unmapped until accessed.
I can often see a pattern when application enters isolation, calls
low-level library such as ODP, gets a page fault, leaves and re-enters
isolation, and then everything is running perfectly because everything
is mapped. However in those cases mlockall() is done before entering
isolation, so regular memory mapping is already there.
I ended up implementing a manager/helper task that talks to tasks over a
socket (when they are not isolated) and over ring buffers in shared memory
(when they are isolated). While the current implementation is rather
limited, the intention is to delegate to it everything that isolated task
either can't do at all (like, writing logs) or that it would be cumbersome
to implement (like monitoring the state of task, determining presence of
deferred work after the task returned to userspace), etc.
Interesting. Are you considering opensourcing such library? Seems like a
generic problem.
It's already open source, https://github.com/abelits/libtmc
It still needs some work. At the moment it does more than I would prefer
because it tries to detect possible problems, such as running timers,
and at the same time does not provide some obviously useful things like
asynchronous interface to arbitrary file I/O.
I also want to allow the use of some generic interface to triggering
interrupts from isolated task to the manager (through, say, a sacrifice
of a single GPIO), so if this option is available, the manager won't
have to do all that polling.
Well everyone swears on having the right implementation. The people I know
would not do any thing with a socket in such situations. They would only
use shared memory and direct access to I/O devices via SPDK and DPDK or
the RDMA subsystem.
Same applies to me. My library uses sockets to communicate when the task
is not isolated, and it will be necessary if we want to have a dedicated
manager process instead of a manager thread in every process. I would
prefer initiating a connection with a manager through a socket, and only
after that succeeds, assume that I can use any particular part of shared
memory (because it means that manager allocated it for me, and no one
else will race with me trying to touch it).
Blocking? The app should fail if any deferred actions are triggered as a
result of syscalls. It would give a warning with _WARN
There are many supposedly innocent things, nowhere at the scale of CPU
hotplug, that happen in a system and result in synchronization implemented
as an IPI to every online CPU. We should consider them to be an ordinary
occurrence, so there is a choice:
1. Ignore them completely and allow them in isolated mode. This will delay
userspace with no indication and no isolation breaking.
2. Allow them, and notify userspace afterwards (through vdso or through
userspace helper/manager over shared memory). This may be useful in those
rare situations when the consequences of delay can be mitigated afterwards.
3. Make them break isolation, with userspace being notified normally (ex:
with a signal in the current implementation). I guess, can be used if
somehow most of the causes will be eliminated.
4. Prevent them from reaching the target CPU and make sure that whatever
synchronization they are intended to cause, will happen when intended target
CPU will enter to kernel later. Since we may have to synchronize things like
code modification, some of this synchronization has to happen very early on
kernel entry.
Or move the actions to a different victim processor like done with rcu and
vmstat etc etc.
If possible. For most of those things everything can be moved to other
CPUs when entering for isolation, or not allowed on CPUs intended for
isolation in the first place (how it's mostly done now). The troublesome
sources of interruption are things that are legitimately supposed to be
done on all CPUs at once to synchronize some important kind of state,
and now we want to delay them on some CPUs until the end of isolation.
I am most interested in (4), so this is what was implemented in my version
of the patch (and currently I am trying to achieve completeness and, if
possible, elegance of the implementation).
Agree. (3) will be necessary as intermediate step. The proposed
improvement to Christoph's reply, in this thread, separates notification
and syscall blockage.
I guess the notification mode will take care of the way we handle these
interruptions.
I think, development should go in parallel -- to have a "delayed
synchronization on entry" mechanism that allows "no-interruption mode"
(4) to work given that all interruptions are dealt with (that won't work
perfectly at first because there are still "unprocessed" sources of
interruptions) and a notification mechanism that will allow us to find
and properly process them as (3), so we can exclude them and allow (4).
Since (4) still requires somewhat intrusive architecture-specific
changes, there may be some time when (4) will be only available on some
CPUs, but (3) will work on everything.
--
Alex