Adding Nitesh to CC. On Thu, Jan 21, 2021 at 12:51:41PM -0300, Marcelo Tosatti wrote: > Hi Alex, > > On Fri, Jan 15, 2021 at 10:35:14AM -0800, Alex Belits wrote: > > On 1/15/21 05:24, Christoph Lameter wrote: > > > > > ---------------------------------------------------------------------- > > > On Thu, 14 Jan 2021, Marcelo Tosatti wrote: > > > > > > > > How does one do a oneshot flush of OS activities? > > > > > > > > ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0); > > > > if (ret == -1) { > > > > perror("prctl PR_TASK_ISOLATION_REQUEST"); > > > > exit(0); > > > > } > > > > > > > > > > > > > > I.e. I have a polling loop over numerous shared and I/o devices in user > > > > > space and I want to make sure that the system is quite before I enter the > > > > > loop. > > > > > > > > You could configure things in two ways: with syscalls allowed or not. > > > > > > Well syscalls that do not cause deferred processing like getting the time > > > or determining the current cpu should be ok to use. > > > > Some of those syscalls go through vdso, and don't enter the kernel -- > > nothing specific is necessary to allow them, and it would be pointless and > > difficult to prevent them. > > > > For syscalls that enter the kernel, it's often difficult to predict, if they > > will or won't cause deferred processing, so I am afraid, it won't be > > possible to provide a "safe" class of syscalls for this purpose and not end > > up with something minimal like reading /sys and /proc. Right now isolation > > only "allows" syscalls that exit isolation. > > Christoph wrote: > > "> Features that I think may be needed: > > > > F_ISOL_QUIESCE -> quiet down now but allow all OS activities. OS > > activites reset flag > > > > F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that > > require such actions in the future. > > > > F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS > > services require delayed processing etc > > but continue while resetting the flag. > " > > It seems the only difference between HARD and WARN (lets call it SOFT) > would be whether a notification is sent to userspace. > > The definition > > "F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that > require such actions in the future." > > fails in the static_key_enable case: Alex's idea is to queue the i-cache > flush if the remote task/cpu is in isolated mode (and perform the flush > when entering the kernel). > > So even if userspace uses syscalls that do not require delayed > processing, there are events which are out of control of the > application and might require it. > > So lets assume the application performs a number of syscalls on a > given time critical codepath. > > Either the system is configured so that > the number/frequency of static_key_enable's is limited, or the cost of > i-cache flushes must be accounted on that critical codepath. > > Anyway, trying to improve Christoph's definition: > > F_ISOL_QUIESCE -> flush any pending operations that might cause > the CPU to be interrupted (ex: free's > per-CPU queues, sync MM statistics > counters, etc). > > F_ISOL_ISOLATE -> inform the kernel that userspace is > entering isolated mode (see description > below on "ISOLATION MODES"). > > F_ISOL_UNISOLATE -> inform the kernel that userspace is > leaving isolated mode. > > F_ISOL_NOTIFY -> notification mode of isolation breakage > modes. > > > Isolation modes: > --------------- > > There are two main types of isolation modes: > > - SOFT mode: does not prevent activities which might generate interruptions > (such as CPU hotplug). > > - HARD mode: prevents all blockable activities that might generate interruptions. > Administrators can override this via /sys. > > Notifications: > ------------- > > Notification mode of isolation breakage can be configured as follows: > > - None (default): No notification is performed by the kernel on isolation > breakage. > > - Syslog: Isolation breakage is reported to syslog. > > (new modes can be added, for example signals). > > A new feature can be added to disallow syscalls (by default syscalls > are enabled, with reporting of pending activities that might cause > an interruption in a VDSO). > > How about that? > > > F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that > > require such actions in the future. > > > > F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS > > services require delayed processing etc > > but continue while resetting the flag. > > > > > It may be possible to set up a filter by the system (allowing few safe > > things like reading /proc) and let the user expand it by adding combinations > > of syscall / file descriptor. If some device is known to process operations > > safely, user can open it and mark file descriptor as allowed, say, for > > reading. > > Makes sense. > > > > And I already said that I want the system to quiet down and allow system > > > calls. Some indication that deferred actions have occurred may be useful > > > by f.e. resetting the flag. > > Do you think reporting activities that add overhead (the i-cache flush > in mind) to syscalls separately in the VDSO? > > > I think, it should be possible to process a syscall, and if any deferred > > action occurred, exit isolation on return to userspace. > > On the interface we are creating: > > ret = syscall()... > if (vdso.pending_activity) { > prctl(PR_TASK_ISOLATION_REQUEST, F_ISOL_UNISOLATE, 0, 0); > ... > } > > Why would it be necessary to exit isolation on return to userspace > again? > > > Then there is a > > question, how userspace should be notified about isolation being lost. > > Normally this happens with a signal, but that is useful if we want syscall > > to fail with EINTR, not to succeed. Make sure that signal arrives after > > successful syscall return but before deferred action to happen? Sounds > > convoluted. Maybe reflecting isolation status in vdso and having the user > > check it there will be a good solution. > > Why can't userspace enable/disable isolation mode (and the kernel only > reports it) ? > > I fail to see why the order of the events "isolated mode disablement" > and "return to userspace" is critical. > > > When I worked on my implementation I have encountered both a problem of > > interaction with the rest of system from isolated task (at least simple > > things as logging) and a problem of handling enter/exit from isolation on a > > system when it's possible for a task to be interrupted early after entering > > isolation due to various events that were still in progress on other CPUs. > > > > I ended up implementing a manager/helper task that talks to tasks over a > > socket (when they are not isolated) and over ring buffers in shared memory > > (when they are isolated). While the current implementation is rather > > limited, the intention is to delegate to it everything that isolated task > > either can't do at all (like, writing logs) or that it would be cumbersome > > to implement (like monitoring the state of task, determining presence of > > deferred work after the task returned to userspace), etc. > > Interesting. Are you considering opensourcing such library? Seems like a > generic problem. > > > It would be great if the complexity and amount of functionality of that > > manager/helper task can be reduced, however I believe that having such a > > task is a legitimate way of implementing things that otherwise would require > > additional functionality in kernel. > > > > > > > > > 1) Add a new isolation feature ISOL_F_BLOCK_SYSCALLS (to block certain > > > > syscalls) along with ISOL_F_SETUP_NOTIF (to notify upon isolation > > > > breaking): > > > > > > Well come up with a use case for that .... I know mine. What you propose > > > could be useful for debugging for me but I would prefer the quiet down > > > approach where I determine when I use some syscalls or not and will deal > > > with the consequences. > > > > For my purposes breaking isolation on syscalls and notifications about > > isolation breaking is a very useful approach -- this is why I kept it > > exactly as it was in the original implementation by Chris Metcalf. > > > > In applications that I intend to use isolation for, the primary concern is > > consistent time for running code in userspace, so syscalls should be only > > issued when the task is specifically not in isolated mode. If the program > > issues a syscall by mistake (and that may happen when some libraries are > > used, or thread synchronization primitives are kept from non-isolated > > version of the program, even though isolated tasks are not supposed to use > > those), it means not only that deferred work may cause delay in the future, > > but also that there is an additional time to be spent in kernel. This should > > be immediately visible to the developer, and the best way to do it is by > > breaking isolation on syscall immediately. > > I guess you can do that by hooking a BPF program to cpu->is_isolated == > true (for development) and syscall entry. > > > > > > > > > > Features that I think may be needed: > > > > > > > > > > F_ISOL_QUIESCE -> quiet down now but allow all OS activities. OS > > > > > activites reset flag > > > > > > > > > > F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that > > > > > require such actions in the future. > > > > > > > > Question: why BAREMETAL ? > > > > > > To disinguish it from "Realtime". We want the processor for ourselves > > > without anything else running on it. > > > > > > > Two comments: > > > > > > > > 1) HARD mode could also block activities from different CPUs that can > > > > interrupt this isolated CPU (for example CPU hotplug, or increasing > > > > per-CPU trace buffer size). > > > > > > Blocking? The app should fail if any deferred actions are triggered as a > > > result of syscalls. It would give a warning with _WARN > > > > There are many supposedly innocent things, nowhere at the scale of CPU > > hotplug, that happen in a system and result in synchronization implemented > > as an IPI to every online CPU. We should consider them to be an ordinary > > occurrence, so there is a choice: > > > > 1. Ignore them completely and allow them in isolated mode. This will delay > > userspace with no indication and no isolation breaking. > > > > 2. Allow them, and notify userspace afterwards (through vdso or through > > userspace helper/manager over shared memory). This may be useful in those > > rare situations when the consequences of delay can be mitigated afterwards. > > > > 3. Make them break isolation, with userspace being notified normally (ex: > > with a signal in the current implementation). I guess, can be used if > > somehow most of the causes will be eliminated. > > > > 4. Prevent them from reaching the target CPU and make sure that whatever > > synchronization they are intended to cause, will happen when intended target > > CPU will enter to kernel later. Since we may have to synchronize things like > > code modification, some of this synchronization has to happen very early on > > kernel entry. > > > > I am most interested in (4), so this is what was implemented in my version > > of the patch (and currently I am trying to achieve completeness and, if > > possible, elegance of the implementation). > > Agree. (3) will be necessary as intermediate step. The proposed > improvement to Christoph's reply, in this thread, separates notification > and syscall blockage. > > > I guess, if we want to add more controls, we can allow the user to choose > > either of those four options, or of a subset of them. In my opinion, if (4) > > will be available, and the only additional cost will be time for > > synchronization spent in breaking isolation procedure, there is not much > > need in the other three. Without (4) I don't think, the goal of providing > > consistent, interruption-free environment is achieved at all, so not > > implementing it would be very bad. > > Agree. > > > > > 2) For a type of application it is the case that certain interruptions > > > > can be tolerated, as long as they do not cross certain thresholds. > > > > For example, one loses the flexibility to read/write MSRs > > > > on the isolated CPUs (including performance counters, > > > > RDT/MBM type MSRs, frequency/power statistics) by > > > > forcing a "no interruptions" mode. > > > > > > Does reading these really cause deferred actions by the OS? AFAICT you > > > could map these into memory as well as read them without OS activities. > > > > Access to those is hardware/architecture-specific, and in many cases, > > indeed, there is no need to issue a syscall at all. > > > > However for many applications the model with a helper task performing > > interactions with OS on a different core and exchanging data over shared > > memory may be sufficient, and it will also provide clear separation between > > operations that do require consistent timing and those that don't. > > I see. > > > > "Interruptions that can be tolerated".... Well that is the wild west of > > > "realtime" where you can define how much of a time slice is "real" and how > > > much can be use by other processes. I do not think that any of that should > > > come into this API. > > > > > > > To be honest, I have no idea, what can and can not be tolerated by > > applications other than what I am familiar with. Applications that I know, > > require no interruptions at all, so I want to implement that. I assume, > > someone already uses existing CPU isolation for the purpose of providing > > "nearly interrupt-less" environment. > > > > I can imaging something like a task of controlling a large slow-updating LED > > display by bit-banging a strictly timed long serial message representing a > > frame or frame update. If interrupted, it may, depending on the protocol, > > corrupt the state of a single LED or fail to update until the end of the > > screen, but the next start of message will reset the state, and everything > > will work until the next interrupt. Maybe there are more realistic or useful > > examples. > > Agree that "no interruptions" as a goal makes most sense. > > Can "whitelist" certain interruptions if necessary (to handle the MSR > read case), if user desires. >