On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote: > The existing nohz_full mode is designed as a "soft" isolation mode > that makes tradeoffs to minimize userspace interruptions while > still attempting to avoid overheads in the kernel entry/exit path, > to provide 100% kernel semantics, etc. > > However, some applications require a "hard" commitment from the > kernel to avoid interruptions, in particular userspace device > driver style applications, such as high-speed networking code. > > This change introduces a framework to allow applications > to elect to have the "hard" semantics as needed, specifying > prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > Subsequent commits will add additional flags and additional > semantics. We are doing this at the process level but the isolation works on the CPU scope... Now I wonder if prctl is the right interface. That said the user is rather interested in isolating a task. The CPU being the backend eventually. For example if the task is migrated by accident, we want it to be warned about that. And if the isolation is done on the CPU level instead of the task level, this won't happen. I'm also afraid that the naming clashes with cpu_isolated_map, although it could be a subset of it. So probably in this case we should consider talking about task rather than CPU isolation and change naming accordingly (sorry, I know I suggested cpu_isolation.c, I guess I had to see the result to realize). We must sort that out first. Either we consider isolation on the task level (and thus the underlying CPU by backend effect) and we use prctl(). Or we do this on the CPU level and we use a specific syscall or sysfs which takes effect on any task in the relevant isolated CPUs. What do you think? It would be nice to hear others opinions as well. > The kernel must be built with the new CPU_ISOLATED Kconfig flag > to enable this mode, and the kernel booted with an appropriate > nohz_full=CPULIST boot argument. The "cpu_isolated" state is then > indicated by setting a new task struct field, cpu_isolated_flags, > to the value passed by prctl(). When the _ENABLE bit is set for a > task, and it is returning to userspace on a nohz_full core, it calls > the new cpu_isolated_enter() routine to take additional actions > to help the task avoid being interrupted in the future. > > Initially, there are only three actions taken. First, the > task calls lru_add_drain() to prevent being interrupted by a > subsequent lru_add_drain_all() call on another core. Then, it calls > quiet_vmstat() to quieten the vmstat worker to avoid a follow-on > interrupt. Finally, the code checks for pending timer interrupts > and quiesces until they are no longer pending. As a result, sys > calls (and page faults, etc.) can be inordinately slow. However, > this quiescing guarantees that no unexpected interrupts will occur, > even if the application intentionally calls into the kernel. > > Signed-off-by: Chris Metcalf <cmetcalf@xxxxxxxxxx> > --- > arch/tile/kernel/process.c | 9 ++++++ > include/linux/cpu_isolated.h | 24 +++++++++++++++ > include/linux/sched.h | 3 ++ > include/uapi/linux/prctl.h | 5 ++++ > kernel/context_tracking.c | 3 ++ > kernel/sys.c | 8 +++++ > kernel/time/Kconfig | 20 +++++++++++++ > kernel/time/Makefile | 1 + > kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ It's not about time :-) The timer is only a part of the isolation. Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would better fit. kernel/task_isolation.c maybe or just kernel/isolation.c I think I prefer the latter because I'm not only interested in that task hard isolation feature, I would like to also drive all the general isolation operations from there (workqueue affinity, rcu nocb, ...). > 9 files changed, 144 insertions(+) > create mode 100644 include/linux/cpu_isolated.h > create mode 100644 kernel/time/cpu_isolated.c > > diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c > index e036c0aa9792..7db6f8386417 100644 > --- a/arch/tile/kernel/process.c > +++ b/arch/tile/kernel/process.c > @@ -70,6 +70,15 @@ void arch_cpu_idle(void) > _cpu_idle(); > } > > +#ifdef CONFIG_CPU_ISOLATED > +void cpu_isolated_wait(void) > +{ > + set_current_state(TASK_INTERRUPTIBLE); > + _cpu_idle(); > + set_current_state(TASK_RUNNING); > +} I'm still uncomfortable with that. A wake up model could work? > +#endif > + > /* > * Release a thread_info structure > */ > diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h > new file mode 100644 > index 000000000000..a3d17360f7ae > --- /dev/null > +++ b/include/linux/cpu_isolated.h > @@ -0,0 +1,24 @@ > +/* > + * CPU isolation related global functions > + */ > +#ifndef _LINUX_CPU_ISOLATED_H > +#define _LINUX_CPU_ISOLATED_H > + > +#include <linux/tick.h> > +#include <linux/prctl.h> > + > +#ifdef CONFIG_CPU_ISOLATED > +static inline bool is_cpu_isolated(void) > +{ > + return tick_nohz_full_cpu(smp_processor_id()) && > + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); > +} > + > +extern void cpu_isolated_enter(void); > +extern void cpu_isolated_wait(void); > +#else > +static inline bool is_cpu_isolated(void) { return false; } > +static inline void cpu_isolated_enter(void) { } > +#endif And all the naming should be about task as well, if we take that task direction. > + > +#endif > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04b5ada460b4..0bb248385d88 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1776,6 +1776,9 @@ struct task_struct { > unsigned long task_state_change; > #endif > int pagefault_disabled; > +#ifdef CONFIG_CPU_ISOLATED > + unsigned int cpu_isolated_flags; > +#endif Can't we add a new flag to tsk->flags? There seem to be some values remaining. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html