On Mon, 2020-11-30 at 15:18 -0300, Marcelo Tosatti wrote: > Two questions: > > What is the reason for not allowing _any_ interruption? Because > it might be the case (and it is in the vRAN use cases our customers > target), that some amount of interruption is tolerable. > > For example, for one use case 20us (of interruptions) every 1ms of > elapsed time is tolerable. I agree that for many applications that would be sufficient, and if so, we can avoid the rest of the mechanism, However I know three types of applications where full isolation is absolutely necessary. 1. Physical process control. A program continuously polls data from sensors, processes it, determines the state of controlled device or process, and issues commands or tunes parameters, controlling it. An unpredictable delay caused by an interrupt or other performance-affecting event can break the timing of this mechanism. Even taking into account that usually safety requirements demand that some low-latency reaction should be performed without CPU involvement, some of the control logic may demand involvement of a CPU usually within a closed control loop. For example, tracking the state of an RF connection and adjusting the signal, analysis of measurements and control of a mechanical system, including elasticity and vibration in industrial and transportation applications, all require a large amount of calculations, and benefit from reduced amount of buffering and delays. In case of RF this allows to maintain more reliable connections. Mechanical systems can have greater precision and lower stress on the components because such system can easily implement dampening and compensation of deformation and deterioration. Usually when such processing is necessary, the "fast" part of it is performed on a microcontroller, separate CPU core built into an ASIC or even a "soft CPU" implemented on an FPGA, while "slow" calculations are performed by a large CPU or SoC. This, however, places severe limitations on the software design and requires engineers to waste "expensive" resources such as including PCIe links into microcontrollers, building ASICs, implementing soft CPU cores on FPGA, etc. while the use of CPU cores of modern CPUs and SoC is limited. Sometimes a large processing mechanism is built out of DSP blocks of FPGA and its built-in static RAM, while more capable SIMD of a CPU remains under-used. 2. Audio processing. A program implements a set of filters and mixers, processing multiple audio streams in a live performance. Processing should be synchronized not only between individual microphones, speakers and processors, but, when used as a part of sound reinforcement system, also with propagation of sound through the environment that bypasses the audio system but is subject to the acoustics of the building. The amount of data is large, however calculations are always the same, so it is possible to tune all delays to match the projected or measured timing. Program runs at the sample rate of the signals, with processing algorithms already creating delays, so additional buffering would place total delay out of the available range. This can be seen as a subset of controlling the mechanical system, except usually a control loop, if present, is much slower (such as adapting to changing acoustical properties), however processing of incoming data includes very large amounts of calculations that should be performed on a full-featured CPU. The total amount of data that should be available to a CPU core at a time, usually can fit into its cache, and if sufficient cache allocation or partitioning is available, all necessary calculations (digital filters, mixing, etc.) can be reliably performed with minimal latency. Interrupts break this model, causing unpredictable cache invalidation, so design has to take into account not just the time spent processing the interrupt but also delays for re-filling the caches with data that normally remains there. 3. Parallel handling of network packets. Modern network-oriented SoCs have hardware scheduling mechanisms that can distribute packets processing, including packet processing within the same flow, to multiple CPU cores, form processing pipelines, etc. Flow-related state, if present, can be kept within the scheduling mechanism, and not handled by CPU cores. When CPUs work in a predictable manner, this allows fast processing of network traffic regardless of the distribution of the amount of data between flows or other details, save for situations that warrant some kind of additional per-flow processing that can't be performed in a stateless manner (and that may be still offloaded to dedicated CPU cores for "slow path"). This is fundamentally different from the model in which each flow is kept on a single CPU for stateful processing and buffering. When interrupt happens, a core is temporarily excluded from this orchestrated process, and the packets that it handles stay in buffer, possibly keeping the whole flows waiting for reordering before the output. Every time this happens, buffers end up holding more data than they would be without interruption, so the whole system can provide lower throughput before it has to drop packets. Handling network buffers and CPU access to them is usually extremely optimized in such system, CPU usually only accesses a subset of header data in network buffers, then makes whatever decisions that are not already done by a complex system of hardware header parsers, CAMs and queues that precede and follow this CPU-based processing. In many cases all three types applications suffer from insufficiently flexible cache management, however recently many CPUs and SoCs are built specifically for the above mentioned purposes, and more complex cache management can be easily added to them. When implementing those it's much easier to allow optimization for running an application with predictable limits on access, that will fall back into un-optimized state when a predictable pattern is broken. This means, designers of both hardware and software can breathe much easier if OS can guarantee interrupt-free environment for timing-critical tasks. Outside of those tasks it would be completely harmless for cache to get out of the optimized state as long as consistency is maintained by hardware, so there is no reason to expect any incompatibility or violation of specifications that would require any special software support. However as a person working on the software side of a hardware company, I can't request such a feature when I know that whatever they will build, will have to crash down like a Jenga tower every second, leave alone milliseconds. To support this kind of tasks, various companies used either full- featured RTOS or "bare metal environments" that are basically specialized extremely reduced OSes, designed to run up to one application per CPU core with some minimal initialization and resource allocation. While I agree that this is not the place to evaluate merit of those approaches in general, I can say that my recent experience with both developers and users, including but not limited to myself being in both roles, shows that many would rather prefer if Linux provided this environment when running some tasks after initialization in a normal mode. If the task can go back and forth between isolated and non-isolated mode, that would greatly simplify re-configuration, and any mechanism for communicating outside of isolated environment can be a bonus, however isolation itself is the important part. This was not the case in early 2000s, when embedded systems were built on single-core SoCs and microcontrollers. Then either everything had to be integrated in a single task under "bare metal", or everything depended on interrupt processing provided by the OS. Now, when network and embedded-oriented chips are built with tens of cores, sometimes it's better if OS prepared everything and left the task, along with its core and devices that it handles, alone and undisturbed. And if this can be done entirely within Linux, consistently and integrated with other related features and design of its subsystems, I think, this is justified. > > > Since I first and foremost care about eliminating all disturbances > > for > > a running userspace task, > > Why? Mostly because I work on (3) from the list above. However experience with other mentioned uses contributed to that, too. I think, some very useful work is already done in Linux for limiting interrupts and disturbances in general and for isolated CPUs in particular. One additional step that will allow to provide a disturbances-free environment and greatly expand possible applications of Linux kernel looks to me like a useful applications of more efforts. > > > my approach is to allow disabling everything > > including "unavoidable" synchronization IPIs, and make kernel entry > > procedure recognize that some delayed synchronization is necessary > > while avoiding race conditions. As far as I can tell, not everyone > > wants to go that far, > > Suppose it depends on how long each interruption takes. Upon the > suggestion from Thomas and Frederic, i've been thinking it should > be possible, with the tracing framework, to record the length of > all interruptions to a given CPU and, every second check how many > have happened (and whether the sum of interruptions exceeds the > acceptable threshold). > > Its not as nice as the task isolation patch (in terms of spotting > the culprit, since one won't get a backtrace on the originating > CPU in case of an IPI for example), but it should be possible > to verify that the latency threshold is not broken (which is > what the application is interested in, isnt it?). With the current version of the isolation patch I have completely removed the diagnostics part of the code. I want to re-introduce a much cleaner implementation of "interrupt cause diagnostics" on top of it, mostly for the purpose of development -- at least I think, it will give me a better tool to chase the root causes of interrupts that task isolation can see. Now that I have functions where all kernel entry and exit passes for isolated tasks, and some easily readable indication of an isolated CPU, it should be easy to collect both "We are going to send an IPI to a task in this special state because..." and "We are doing something specific in kernel, and it happens, our task was in this special state". It may be that just using tracing this can be done as well, however if we will expand conditions for cause recording beyond just fully isolated task, a separate recording mechanism would be able to collect useful information and statistics with less of its own impact on latency and performance. At least, I think, it makes sense to try separately from task isolation implementation itself, that no longer depends on knowing why kernel was entered. As for measuring latency, as I have mentioned before, it's not just the impact of time being spent in kernel in interrupt processing that matters, it's also all the memory that is being accessed in the process of doing so, and whatever performance-affecting changes, such as TLB flushes or expected delayed work, that made the call necessary in the first place. And for non-isolated tasks there is also a possibility of preemption that can further complicate the picture with timing. > > > and it may make sense to allow "almost isolated > > tasks" that still receive normal interrupts, including IPIs and > > page > > faults. > > > > While that would be useless for the purposes that task > > isolation patch was developed for, I recognize that some might > > prefer > > that to be one of the options set by the same prctl call. This > > still > > remains close enough to the design of task isolation -- same idea > > of > > something that affects CPU but being tied to a given task (and > > dying > > with it), same model of handling attributes, etc. > > > > Maybe there can be a mask of what we do and don't want to avoid for > > the > > task. Say, some may want to only allow page faults or syscalls. Or > > re- > > enter isolation on breaking without notifying the userspace. > > OK, i will try to come up with an interface that allows additional > attributes - please review and let me know if it works for task > isolation patchset. I think, in addition to enable/disable and configuring a signal we will need some flags that enable and disable features, such as full isolation and avoidance of specific events and causes, logging (if supported), etc. > > Can you talk a little about the signal handling part? What type > of applications are expected to perform once isolation is broken? First and foremost, a task may be killed because a signal was sent to it by another task, terminal, socket or pipe. Then it's completely normal signal processing, possibly a userspace cleanup, reloading, etc., except isolation breaking happens on kernel entry. Second, application may encounter an early exit from isolation immediately after entering because some kernel event that was supposed to cause IPIs to all non-isolated tasks, happened while it was entering. Then all it has to do is re-enter. Third, and this is what I do in libtmc, isolation breaking is handled by notifying a separate isolation manager thread or process, so it can check the state of tasks and timers, and decide if and when it should attempt re-enter, ask the task to leave and re-enter the CPU, etc. I will be happier if those actions became unnecessary, however for now it's the easiest way to check if anything and what in particular may prevent the task from entering isolation. Maybe it will make sense to make libtmc use ptrace() on a task from isolation manager, so isolation manager will handle that signal when necessary, and then notify the task. As an alternative I have added a simple readable isolation mask that can be checked by processes (ones that are not isolated) such as isolation manager, and if there will be a separate mechanism for reporting isolation breaking, it may be used instead of the signal (and then isolation manager will be able to communicate with the task using shared memory, signal or other means). The original patch used a signal, and I can't predict what else users may want to do on isolation breaking, so I have kept that interface unchanged. -- Alex