> On Apr 9, 2020, at 8:21 AM, Alex Belits <abelits@xxxxxxxxxxx> wrote: > > The existing nohz_full mode is designed as a "soft" isolation mode > that makes tradeoffs to minimize userspace interruptions while > still attempting to avoid overheads in the kernel entry/exit path, > to provide 100% kernel semantics, etc. > > However, some applications require a "hard" commitment from the > kernel to avoid interruptions, in particular userspace device driver > style applications, such as high-speed networking code. > > This change introduces a framework to allow applications > to elect to have the "hard" semantics as needed, specifying > prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. > > The kernel must be built with the new TASK_ISOLATION Kconfig flag > to enable this mode, and the kernel booted with an appropriate > "isolcpus=nohz,domain,CPULIST" boot argument to enable > nohz_full and isolcpus. The "task_isolation" state is then indicated > by setting a new task struct field, task_isolation_flag, to the > value passed by prctl(), and also setting a TIF_TASK_ISOLATION > bit in the thread_info flags. When the kernel is returning to > userspace from the prctl() call and sees TIF_TASK_ISOLATION set, > it calls the new task_isolation_start() routine to arrange for > the task to avoid being interrupted in the future. > > With interrupts disabled, task_isolation_start() ensures that kernel > subsystems that might cause a future interrupt are quiesced. If it > doesn't succeed, it adjusts the syscall return value to indicate that > fact, and userspace can retry as desired. In addition to stopping > the scheduler tick, the code takes any actions that might avoid > a future interrupt to the core, such as a worker thread being > scheduled that could be quiesced now (e.g. the vmstat worker) > or a future IPI to the core to clean up some state that could be > cleaned up now (e.g. the mm lru per-cpu cache). > > Once the task has returned to userspace after issuing the prctl(), > if it enters the kernel again via system call, page fault, or any > other exception or irq, the kernel will kill it with SIGKILL. I could easily imagine myself using task isolation, but not with the SIGKILL semantics. SIGKILL causes data loss. Please at least let users choose what signal to send.