On Sun, Mar 08, 2020 at 03:47:08AM +0000, Alex Belits wrote: > The existing nohz_full mode is designed as a "soft" isolation mode > that makes tradeoffs to minimize userspace interruptions while > still attempting to avoid overheads in the kernel entry/exit path, > to provide 100% kernel semantics, etc. > > However, some applications require a "hard" commitment from the > kernel to avoid interruptions, in particular userspace device driver > style applications, such as high-speed networking code. > > This change introduces a framework to allow applications > to elect to have the "hard" semantics as needed, specifying > prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. > > The kernel must be built with the new TASK_ISOLATION Kconfig flag > to enable this mode, and the kernel booted with an appropriate > "isolcpus=nohz,domain,CPULIST" boot argument to enable > nohz_full and isolcpus. The "task_isolation" state is then indicated > by setting a new task struct field, task_isolation_flag, to the > value passed by prctl(), and also setting a TIF_TASK_ISOLATION > bit in the thread_info flags. When the kernel is returning to > userspace from the prctl() call and sees TIF_TASK_ISOLATION set, > it calls the new task_isolation_start() routine to arrange for > the task to avoid being interrupted in the future. > > With interrupts disabled, task_isolation_start() ensures that kernel > subsystems that might cause a future interrupt are quiesced. If it > doesn't succeed, it adjusts the syscall return value to indicate that > fact, and userspace can retry as desired. In addition to stopping > the scheduler tick, the code takes any actions that might avoid > a future interrupt to the core, such as a worker thread being > scheduled that could be quiesced now (e.g. the vmstat worker) > or a future IPI to the core to clean up some state that could be > cleaned up now (e.g. the mm lru per-cpu cache). > > Once the task has returned to userspace after issuing the prctl(), > if it enters the kernel again via system call, page fault, or any > other exception or irq, the kernel will kill it with SIGKILL. > In addition to sending a signal, the code supports a kernel > command-line "task_isolation_debug" flag which causes a stack > backtrace to be generated whenever a task loses isolation. > > To allow the state to be entered and exited, the syscall checking > test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can > clear the bit again later, and ignores exit/exit_group to allow > exiting the task without a pointless signal being delivered. > > The prctl() API allows for specifying a signal number to use instead > of the default SIGKILL, to allow for catching the notification > signal; for example, in a production environment, it might be > helpful to log information to the application logging mechanism > before exiting. Or, the signal handler might choose to reset the > program counter back to the code segment intended to be run isolated > via prctl() to continue execution. > > In a number of cases we can tell on a remote cpu that we are > going to be interrupting the cpu, e.g. via an IPI or a TLB flush. > In that case we generate the diagnostic (and optional stack dump) > on the remote core to be able to deliver better diagnostics. > If the interrupt is not something caught by Linux (e.g. a > hypervisor interrupt) we can also request a reschedule IPI to > be sent to the remote core so it can be sure to generate a > signal to notify the process. > > Separate patches that follow provide these changes for x86, arm, > and arm64. > > Signed-off-by: Alex Belits <abelits@xxxxxxxxxxx> > --- > .../admin-guide/kernel-parameters.txt | 6 + > include/linux/hrtimer.h | 4 + > include/linux/isolation.h | 229 ++++++ > include/linux/sched.h | 4 + > include/linux/tick.h | 3 + > include/uapi/linux/prctl.h | 6 + > init/Kconfig | 28 + > kernel/Makefile | 2 + > kernel/context_tracking.c | 2 + > kernel/isolation.c | 774 ++++++++++++++++++ > kernel/signal.c | 2 + > kernel/sys.c | 6 + > kernel/time/hrtimer.c | 27 + > kernel/time/tick-sched.c | 18 + > 14 files changed, 1111 insertions(+) > create mode 100644 include/linux/isolation.h > create mode 100644 kernel/isolation.c > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index c07815d230bc..e4a2d6e37645 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -4808,6 +4808,12 @@ > neutralize any effect of /proc/sys/kernel/sysrq. > Useful for debugging. > > + task_isolation_debug [KNL] > + In kernels built with CONFIG_TASK_ISOLATION, this > + setting will generate console backtraces to > + accompany the diagnostics generated about > + interrupting tasks running with task isolation. > + > tcpmhash_entries= [KNL,NET] > Set the number of tcp_metrics_hash slots. > Default value is 8192 or 16384 depending on total > diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h > index 15c8ac313678..e81252eb4f92 100644 > --- a/include/linux/hrtimer.h > +++ b/include/linux/hrtimer.h > @@ -528,6 +528,10 @@ extern void __init hrtimers_init(void); > /* Show pending timers: */ > extern void sysrq_timer_list_show(void); > > +#ifdef CONFIG_TASK_ISOLATION > +extern void kick_hrtimer(void); > +#endif > + > int hrtimers_prepare_cpu(unsigned int cpu); > #ifdef CONFIG_HOTPLUG_CPU > int hrtimers_dead_cpu(unsigned int cpu); > diff --git a/include/linux/isolation.h b/include/linux/isolation.h > new file mode 100644 > index 000000000000..6bd71c67f10f > --- /dev/null > +++ b/include/linux/isolation.h > @@ -0,0 +1,229 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * Task isolation support > + * > + * Authors: > + * Chris Metcalf <cmetcalf@xxxxxxxxxxxx> > + * Alex Belits <abelits@xxxxxxxxxxx> > + * Yuri Norov <ynorov@xxxxxxxxxxx> > + */ > +#ifndef _LINUX_ISOLATION_H > +#define _LINUX_ISOLATION_H > + > +#include <stdarg.h> > +#include <linux/errno.h> > +#include <linux/cpumask.h> > +#include <linux/prctl.h> > +#include <linux/types.h> > + > +struct task_struct; > + > +#ifdef CONFIG_TASK_ISOLATION > + > +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...); > + > +#define pr_task_isol_emerg(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__) > +#define pr_task_isol_alert(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__) > +#define pr_task_isol_crit(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__) > +#define pr_task_isol_err(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__) > +#define pr_task_isol_warn(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__) > +#define pr_task_isol_notice(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__) > +#define pr_task_isol_info(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__) > +#define pr_task_isol_debug(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__) > + > +#define pr_task_isol_emerg_supp(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__) > +#define pr_task_isol_alert_supp(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__) > +#define pr_task_isol_crit_supp(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__) > +#define pr_task_isol_err_supp(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__) > +#define pr_task_isol_warn_supp(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__) > +#define pr_task_isol_notice_supp(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__) > +#define pr_task_isol_info_supp(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__) > +#define pr_task_isol_debug_supp(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__) > +DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy); gcc output: In file included from ./arch/x86/include/asm/apic.h:6, from arch/x86/kernel/apic/apic_noop.c:14: ./include/linux/isolation.h:58:32: error: unknown type name 'tsk_thread_flags_copy' DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy); ^~~~~~~~~~~~~~~~~~~~~ My fix: iff --git a/include/linux/isolation.h b/include/linux/isolation.h index 6bd71c67f10f..a392abed304b 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -55,7 +55,7 @@ int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...); task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__) #define pr_task_isol_debug_supp(cpu, fmt, ...) \ task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__) -DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy); +//DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy); extern cpumask_var_t task_isolation_map; /** > +extern cpumask_var_t task_isolation_map; > + > +/** > + * task_isolation_request() - prctl hook to request task isolation > + * @flags: Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx. > + * > + * This is called from the generic prctl() code for PR_TASK_ISOLATION. > + * > + * Return: Returns 0 when task isolation enabled, otherwise a negative > + * errno. > + */ > +extern int task_isolation_request(unsigned int flags); > +extern void task_isolation_cpu_cleanup(void); > +/** > + * task_isolation_start() - attempt to actually start task isolation > + * > + * This function should be invoked as the last thing prior to returning to > + * user space if TIF_TASK_ISOLATION is set in the thread_info flags. It > + * will attempt to quiesce the core and enter task-isolation mode. If it > + * fails, it will reset the system call return value to an error code that > + * indicates the failure mode. > + */ > +extern void task_isolation_start(void); > + > +/** > + * is_isolation_cpu() - check if CPU is intended for running isolated tasks. > + * @cpu: CPU to check. > + */ > +static inline bool is_isolation_cpu(int cpu) > +{ > + return task_isolation_map != NULL && > + cpumask_test_cpu(cpu, task_isolation_map); > +} > + > +/** > + * task_isolation_on_cpu() - check if the cpu is running isolated task > + * @cpu: CPU to check. > + */ > +extern int task_isolation_on_cpu(int cpu); > +extern void task_isolation_check_run_cleanup(void); > + > +/** > + * task_isolation_cpumask() - set CPUs currently running isolated tasks > + * @mask: Mask to modify. > + */ > +extern void task_isolation_cpumask(struct cpumask *mask); > + > +/** > + * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks > + * @mask: Mask to modify. > + */ > +extern void task_isolation_clear_cpumask(struct cpumask *mask); > + > +/** > + * task_isolation_syscall() - report a syscall from an isolated task > + * @nr: The syscall number. > + * > + * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is > + * set in the thread_info flags. It checks for valid syscalls, > + * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group(). > + * For any other syscall it will raise a signal and return failure. > + * > + * Return: 0 for acceptable syscalls, -1 for all others. > + */ > +extern int task_isolation_syscall(int nr); > + > +/** > + * _task_isolation_interrupt() - report an interrupt of an isolated task > + * @fmt: A format string describing the interrupt > + * @...: Format arguments, if any. > + * > + * This routine should be invoked at any exception or IRQ if > + * TIF_TASK_ISOLATION is set in the thread_info flags. It is not necessary > + * to invoke it if the exception will generate a signal anyway (e.g. a bad > + * page fault), and in that case it is preferable not to invoke it but just > + * rely on the standard Linux signal. The macro task_isolation_syscall() > + * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code. > + */ > +extern void _task_isolation_interrupt(const char *fmt, ...); > +#define task_isolation_interrupt(fmt, ...) \ > + do { \ > + if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \ > + _task_isolation_interrupt(fmt, ## __VA_ARGS__); \ > + } while (0) > + > +/** > + * task_isolation_remote() - report a remote interrupt of an isolated task > + * @cpu: The remote cpu that is about to be interrupted. > + * @fmt: A format string describing the interrupt > + * @...: Format arguments, if any. > + * > + * This routine should be invoked any time a remote IPI or other type of > + * interrupt is being delivered to another cpu. The function will check to > + * see if the target core is running a task-isolation task, and generate a > + * diagnostic on the console if so; in addition, we tag the task so it > + * doesn't generate another diagnostic when the interrupt actually arrives. > + * Generating a diagnostic remotely yields a clearer indication of what > + * happened then just reporting only when the remote core is interrupted. > + * > + */ > +extern void task_isolation_remote(int cpu, const char *fmt, ...); > + > +/** > + * task_isolation_remote_cpumask() - report interruption of multiple cpus > + * @mask: The set of remotes cpus that are about to be interrupted. > + * @fmt: A format string describing the interrupt > + * @...: Format arguments, if any. > + * > + * This is the cpumask variant of _task_isolation_remote(). We > + * generate a single-line diagnostic message even if multiple remote > + * task-isolation cpus are being interrupted. > + */ > +extern void task_isolation_remote_cpumask(const struct cpumask *mask, > + const char *fmt, ...); > + > +/** > + * _task_isolation_signal() - disable task isolation when signal is pending > + * @task: The task for which to disable isolation. > + * > + * This function generates a diagnostic and disables task isolation; it > + * should be called if TIF_TASK_ISOLATION is set when notifying a task of a > + * pending signal. The task_isolation_interrupt() function normally > + * generates a diagnostic for events that just interrupt a task without > + * generating a signal; here we need to hook the paths that correspond to > + * interrupts that do generate a signal. The macro task_isolation_signal() > + * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code. > + */ > +extern void _task_isolation_signal(struct task_struct *task); > +#define task_isolation_signal(task) \ > + do { \ > + if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \ > + _task_isolation_signal(task); \ > + } while (0) > + > +/** > + * task_isolation_user_exit() - debug all user_exit calls > + * > + * By default, we don't generate an exception in the low-level user_exit() > + * code, because programs lose the ability to disable task isolation: the > + * user_exit() hook will cause a signal prior to task_isolation_syscall() > + * disabling task isolation. In addition, it means that we lose all the > + * diagnostic info otherwise available from task_isolation_interrupt() hooks > + * later in the interrupt-handling process. But you may enable it here for > + * a special kernel build if you are having undiagnosed userspace jitter. > + */ > +static inline void task_isolation_user_exit(void) > +{ > +#ifdef DEBUG_TASK_ISOLATION > + task_isolation_interrupt("user_exit"); > +#endif > +} > + > +#else /* !CONFIG_TASK_ISOLATION */ > +static inline int task_isolation_request(unsigned int flags) { return -EINVAL; } > +static inline void task_isolation_start(void) { } > +static inline bool is_isolation_cpu(int cpu) { return 0; } > +static inline int task_isolation_on_cpu(int cpu) { return 0; } > +static inline void task_isolation_cpumask(struct cpumask *mask) { } > +static inline void task_isolation_clear_cpumask(struct cpumask *mask) { } > +static inline void task_isolation_cpu_cleanup(void) { } > +static inline void task_isolation_check_run_cleanup(void) { } > +static inline int task_isolation_syscall(int nr) { return 0; } > +static inline void task_isolation_interrupt(const char *fmt, ...) { } > +static inline void task_isolation_remote(int cpu, const char *fmt, ...) { } > +static inline void task_isolation_remote_cpumask(const struct cpumask *mask, > + const char *fmt, ...) { } > +static inline void task_isolation_signal(struct task_struct *task) { } > +static inline void task_isolation_user_exit(void) { } > +#endif > + > +#endif /* _LINUX_ISOLATION_H */ > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04278493bf15..52fdb32aa3b9 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1280,6 +1280,10 @@ struct task_struct { > unsigned long lowest_stack; > unsigned long prev_lowest_stack; > #endif > +#ifdef CONFIG_TASK_ISOLATION > + unsigned short task_isolation_flags; /* prctl */ > + unsigned short task_isolation_state; > +#endif > > /* > * New fields for task_struct should be added above here, so that > diff --git a/include/linux/tick.h b/include/linux/tick.h > index 7340613c7eff..27c7c033d5a8 100644 > --- a/include/linux/tick.h > +++ b/include/linux/tick.h > @@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal, > extern void tick_nohz_full_kick_cpu(int cpu); > extern void __tick_nohz_task_switch(void); > extern void __init tick_nohz_full_setup(cpumask_var_t cpumask); > +#ifdef CONFIG_TASK_ISOLATION > +extern int try_stop_full_tick(void); > +#endif > #else > static inline bool tick_nohz_full_enabled(void) { return false; } > static inline bool tick_nohz_full_cpu(int cpu) { return false; } > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 07b4f8131e36..f4848ed2a069 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -238,4 +238,10 @@ struct prctl_mm_map { > #define PR_SET_IO_FLUSHER 57 > #define PR_GET_IO_FLUSHER 58 > > +/* Enable task_isolation mode for TASK_ISOLATION kernels. */ > +#define PR_TASK_ISOLATION 48 > +# define PR_TASK_ISOLATION_ENABLE (1 << 0) > +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) > +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/init/Kconfig b/init/Kconfig > index 20a6ac33761c..ecdf567f6bd4 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -576,6 +576,34 @@ config CPU_ISOLATION > > source "kernel/rcu/Kconfig" > > +config HAVE_ARCH_TASK_ISOLATION > + bool > + > +config TASK_ISOLATION > + bool "Provide hard CPU isolation from the kernel on demand" > + depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION > + help > + > + Allow userspace processes that place themselves on cores with > + nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION), > + to "isolate" themselves from the kernel. Prior to returning to > + userspace, isolated tasks will arrange that no future kernel > + activity will interrupt the task while the task is running in > + userspace. Attempting to re-enter the kernel while in this mode > + will cause the task to be terminated with a signal; you must > + explicitly use prctl() to disable task isolation before resuming > + normal use of the kernel. > + > + This "hard" isolation from the kernel is required for userspace > + tasks that are running hard real-time tasks in userspace, such as > + a high-speed network driver in userspace. Without this option, but > + with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft" > + effort to shield a single userspace process from interrupts, but > + makes no guarantees. > + > + You should say "N" unless you are intending to run a > + high-performance userspace driver or similar task. > + > config BUILD_BIN2C > bool > default n > diff --git a/kernel/Makefile b/kernel/Makefile > index 4cb4130ced32..2f2ae91f90d5 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -122,6 +122,8 @@ obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o > KASAN_SANITIZE_stackleak.o := n > KCOV_INSTRUMENT_stackleak.o := n > > +obj-$(CONFIG_TASK_ISOLATION) += isolation.o > + > $(obj)/configs.o: $(obj)/config_data.gz > > targets += config_data.gz > diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c > index 0296b4bda8f1..e9206736f219 100644 > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -21,6 +21,7 @@ > #include <linux/hardirq.h> > #include <linux/export.h> > #include <linux/kprobes.h> > +#include <linux/isolation.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/context_tracking.h> > @@ -157,6 +158,7 @@ void __context_tracking_exit(enum ctx_state state) > if (state == CONTEXT_USER) { > vtime_user_exit(current); > trace_user_exit(0); > + task_isolation_user_exit(); > } > } > __this_cpu_write(context_tracking.state, CONTEXT_KERNEL); > diff --git a/kernel/isolation.c b/kernel/isolation.c > new file mode 100644 > index 000000000000..ae29732c376c > --- /dev/null > +++ b/kernel/isolation.c > @@ -0,0 +1,774 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * linux/kernel/isolation.c > + * > + * Implementation of task isolation. > + * > + * Authors: > + * Chris Metcalf <cmetcalf@xxxxxxxxxxxx> > + * Alex Belits <abelits@xxxxxxxxxxx> > + * Yuri Norov <ynorov@xxxxxxxxxxx> > + */ > + > +#include <linux/mm.h> > +#include <linux/swap.h> > +#include <linux/vmstat.h> > +#include <linux/sched.h> > +#include <linux/isolation.h> > +#include <linux/syscalls.h> > +#include <linux/smp.h> > +#include <linux/tick.h> > +#include <asm/unistd.h> > +#include <asm/syscall.h> > +#include <linux/hrtimer.h> > + > +/* > + * These values are stored in task_isolation_state. > + * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still > + * returning from sys_prctl() to userspace. > + */ > +enum { > + STATE_NORMAL = 0, /* Not isolated */ > + STATE_ISOLATED = 1 /* In userspace, isolated */ > +}; > + > +/* > + * This variable contains thread flags copied at the moment > + * when schedule() switched to the task on a given CPU, > + * or 0 if no task is running. > + */ > +DEFINE_PER_CPU(unsigned long, tsk_thread_flags_cache); > + > +/* > + * Counter for isolation state on a given CPU, increments when entering > + * isolation and decrements when exiting isolation (before or after the > + * cleanup). Multiple simultaneously running procedures entering or > + * exiting isolation are prevented by checking the result of > + * incrementing or decrementing this variable. This variable is both > + * incremented and decremented by CPU that caused isolation entering or > + * exit. > + * > + * This is necessary because multiple isolation-breaking events may happen > + * at once (or one as the result of the other), however isolation exit > + * may only happen once to transition from isolated to non-isolated state. > + * Therefore, if decrementing this counter results in a value less than 0, > + * isolation exit procedure can't be started -- it already happened, or is > + * in progress, or isolation is not entered yet. > + */ > +DEFINE_PER_CPU(atomic_t, isol_counter); > + > +/* > + * Description of the last two tasks that ran isolated on a given CPU. > + * This is intended only for messages about isolation breaking. We > + * don't want any references to actual task while accessing this from > + * CPU that caused isolation breaking -- we know nothing about timing > + * and don't want to use locking or RCU. > + */ > +struct isol_task_desc { > + atomic_t curr_index; > + atomic_t curr_index_wr; > + bool warned[2]; > + pid_t pid[2]; > + pid_t tgid[2]; > + char comm[2][TASK_COMM_LEN]; > +}; > +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs); > + > +/* > + * Counter for isolation exiting procedures (from request to the start of > + * cleanup) being attempted at once on a CPU. Normally incrementing of > + * this counter is performed from the CPU that caused isolation breaking, > + * however decrementing is done from the cleanup procedure, delegated to > + * the CPU that is exiting isolation, not from the CPU that caused isolation > + * breaking. > + * > + * If incrementing this counter while starting isolation exit procedure > + * results in a value greater than 0, isolation exiting is already in > + * progress, and cleanup did not start yet. This means, counter should be > + * decremented back, and isolation exit that is already in progress, should > + * be allowed to complete. Otherwise, a new isolation exit procedure should > + * be started. > + */ > +DEFINE_PER_CPU(atomic_t, isol_exit_counter); > + > +/* > + * Descriptor for isolation-breaking SMP calls > + */ > +DEFINE_PER_CPU(call_single_data_t, isol_break_csd); > + > +cpumask_var_t task_isolation_map; > +cpumask_var_t task_isolation_cleanup_map; > +static DEFINE_SPINLOCK(task_isolation_cleanup_lock); > + > +/* We can run on cpus that are isolated from the scheduler and are nohz_full. */ > +static int __init task_isolation_init(void) > +{ > + alloc_bootmem_cpumask_var(&task_isolation_cleanup_map); > + if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL)) > + /* > + * At this point task isolation should match > + * nohz_full. This may change in the future. > + */ > + cpumask_copy(task_isolation_map, tick_nohz_full_mask); > + return 0; > +} > +core_initcall(task_isolation_init) > + > +/* Enable stack backtraces of any interrupts of task_isolation cores. */ > +static bool task_isolation_debug; > +static int __init task_isolation_debug_func(char *str) > +{ > + task_isolation_debug = true; > + return 1; > +} > +__setup("task_isolation_debug", task_isolation_debug_func); > + > +/* > + * Record name, pid and group pid of the task entering isolation on > + * the current CPU. > + */ > +static void record_curr_isolated_task(void) > +{ > + int ind; > + int cpu = smp_processor_id(); > + struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu); > + struct task_struct *task = current; > + > + /* Finish everything before recording current task */ > + smp_mb(); > + ind = atomic_inc_return(&desc->curr_index_wr) & 1; > + desc->comm[ind][sizeof(task->comm) - 1] = '\0'; > + memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1); > + desc->pid[ind] = task->pid; > + desc->tgid[ind] = task->tgid; > + desc->warned[ind] = false; > + /* Write everything, to be seen by other CPUs */ > + smp_mb(); > + atomic_inc(&desc->curr_index); > + /* Everyone will see the new record from this point */ > + smp_mb(); > +} > + > +/* > + * Print message prefixed with the description of the current (or > + * last) isolated task on a given CPU. Intended for isolation breaking > + * messages that include target task for the user's convenience. > + * > + * Messages produced with this function may have obsolete task > + * information if isolated tasks managed to exit, start and enter > + * isolation multiple times, or multiple tasks tried to enter > + * isolation on the same CPU at once. For those unusual cases it would > + * contain a valid description of the cause for isolation breaking and > + * target CPU number, just not the correct description of which task > + * ended up losing isolation. > + */ > +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...) > +{ > + struct isol_task_desc *desc; > + struct task_struct *task; > + va_list args; > + char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20]; > + char buf[200]; > + int curr_cpu, ind_counter, ind_counter_old, ind; > + > + curr_cpu = get_cpu(); > + desc = &per_cpu(isol_task_descs, cpu); > + ind_counter = atomic_read(&desc->curr_index); > + > + if (curr_cpu == cpu) { > + /* > + * Message is for the current CPU so current > + * task_struct should be used instead of cached > + * information. > + * > + * Like in other diagnostic messages, if issued from > + * interrupt context, current will be the interrupted > + * task. Unlike other diagnostic messages, this is > + * always relevant because the message is about > + * interrupting a task. > + */ > + ind = ind_counter & 1; > + if (supp && desc->warned[ind]) { > + /* > + * If supp is true, skip the message if the > + * same task was mentioned in the message > + * originated on remote CPU, and it did not > + * re-enter isolated state since then (warned > + * is true). Only local messages following > + * remote messages, likely about the same > + * isolation breaking event, are skipped to > + * avoid duplication. If remote cause is > + * immediately followed by a local one before > + * isolation is broken, local cause is skipped > + * from messages. > + */ > + put_cpu(); > + return 0; > + } > + task = current; > + snprintf(buf_prefix, sizeof(buf_prefix), > + "isolation %s/%d/%d (cpu %d)", > + task->comm, task->tgid, task->pid, cpu); > + put_cpu(); > + } else { > + /* > + * Message is for remote CPU, use cached information. > + */ > + put_cpu(); > + /* > + * Make sure, index remained unchanged while data was > + * copied. If it changed, data that was copied may be > + * inconsistent because two updates in a sequence could > + * overwrite the data while it was being read. > + */ > + do { > + /* Make sure we are reading up to date values */ > + smp_mb(); > + ind = ind_counter & 1; > + snprintf(buf_prefix, sizeof(buf_prefix), > + "isolation %s/%d/%d (cpu %d)", > + desc->comm[ind], desc->tgid[ind], > + desc->pid[ind], cpu); > + desc->warned[ind] = true; > + ind_counter_old = ind_counter; > + /* Record the warned flag, then re-read descriptor */ > + smp_mb(); > + ind_counter = atomic_read(&desc->curr_index); > + /* > + * If the counter changed, something was updated, so > + * repeat everything to get the current data > + */ > + } while (ind_counter != ind_counter_old); > + } > + > + va_start(args, fmt); > + vsnprintf(buf, sizeof(buf), fmt, args); > + va_end(args); > + > + switch (level) { > + case LOGLEVEL_EMERG: > + pr_emerg("%s: %s", buf_prefix, buf); > + break; > + case LOGLEVEL_ALERT: > + pr_alert("%s: %s", buf_prefix, buf); > + break; > + case LOGLEVEL_CRIT: > + pr_crit("%s: %s", buf_prefix, buf); > + break; > + case LOGLEVEL_ERR: > + pr_err("%s: %s", buf_prefix, buf); > + break; > + case LOGLEVEL_WARNING: > + pr_warn("%s: %s", buf_prefix, buf); > + break; > + case LOGLEVEL_NOTICE: > + pr_notice("%s: %s", buf_prefix, buf); > + break; > + case LOGLEVEL_INFO: > + pr_info("%s: %s", buf_prefix, buf); > + break; > + case LOGLEVEL_DEBUG: > + pr_debug("%s: %s", buf_prefix, buf); > + break; > + default: > + /* No message without a valid level */ > + return 0; > + } > + return 1; > +} > + > +/* > + * Dump stack if need be. This can be helpful even from the final exit > + * to usermode code since stack traces sometimes carry information about > + * what put you into the kernel, e.g. an interrupt number encoded in > + * the initial entry stack frame that is still visible at exit time. > + */ > +static void debug_dump_stack(void) > +{ > + if (task_isolation_debug) > + dump_stack(); > +} > + > +/* > + * Set the flags word but don't try to actually start task isolation yet. > + * We will start it when entering user space in task_isolation_start(). > + */ > +int task_isolation_request(unsigned int flags) > +{ > + struct task_struct *task = current; > + > + /* > + * The task isolation flags should always be cleared just by > + * virtue of having entered the kernel. > + */ > + WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION)); > + WARN_ON_ONCE(task->task_isolation_flags != 0); > + WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL); > + > + task->task_isolation_flags = flags; > + if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) > + return 0; > + > + /* We are trying to enable task isolation. */ > + set_tsk_thread_flag(task, TIF_TASK_ISOLATION); > + > + /* > + * Shut down the vmstat worker so we're not interrupted later. > + * We have to try to do this here (with interrupts enabled) since > + * we are canceling delayed work and will call flush_work() > + * (which enables interrupts) and possibly schedule(). > + */ > + quiet_vmstat_sync(); > + > + /* We return 0 here but we may change that in task_isolation_start(). */ > + return 0; > +} > + > +/* > + * Perform actions that should be done immediately on exit from isolation. > + */ > +static void fast_task_isolation_cpu_cleanup(void *info) > +{ > + atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id())); > + /* At this point breaking isolation from other CPUs is possible again */ > + > + /* > + * This task is no longer isolated (and if by any chance this > + * is the wrong task, it's already not isolated) > + */ > + current->task_isolation_flags = 0; > + clear_tsk_thread_flag(current, TIF_TASK_ISOLATION); > + > + /* Run the rest of cleanup later */ > + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); > + > + /* Copy flags with task isolation disabled */ > + this_cpu_write(tsk_thread_flags_cache, > + READ_ONCE(task_thread_info(current)->flags)); > +} > + > +/* Disable task isolation for the specified task. */ > +static void stop_isolation(struct task_struct *p) > +{ > + int cpu, this_cpu; > + unsigned long flags; > + > + this_cpu = get_cpu(); > + cpu = task_cpu(p); > + if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) { > + /* Already exiting isolation */ > + atomic_dec(&per_cpu(isol_exit_counter, cpu)); > + put_cpu(); > + return; > + } > + > + if (p == current) { > + p->task_isolation_state = STATE_NORMAL; > + fast_task_isolation_cpu_cleanup(NULL); > + task_isolation_cpu_cleanup(); > + if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) { > + /* Is not isolated already */ > + atomic_inc(&per_cpu(isol_counter, cpu)); > + } > + put_cpu(); > + } else { > + if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) { > + /* Is not isolated already */ > + atomic_inc(&per_cpu(isol_counter, cpu)); > + atomic_dec(&per_cpu(isol_exit_counter, cpu)); > + put_cpu(); > + return; > + } > + /* > + * Schedule "slow" cleanup. This relies on > + * TIF_NOTIFY_RESUME being set > + */ > + spin_lock_irqsave(&task_isolation_cleanup_lock, flags); > + cpumask_set_cpu(cpu, task_isolation_cleanup_map); > + spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags); > + /* > + * Setting flags is delegated to the CPU where > + * isolated task is running > + * isol_exit_counter will be decremented from there as well. > + */ > + per_cpu(isol_break_csd, cpu).func = > + fast_task_isolation_cpu_cleanup; > + per_cpu(isol_break_csd, cpu).info = NULL; > + per_cpu(isol_break_csd, cpu).flags = 0; > + smp_call_function_single_async(cpu, > + &per_cpu(isol_break_csd, cpu)); > + put_cpu(); > + } > +} > + > +/* > + * This code runs with interrupts disabled just before the return to > + * userspace, after a prctl() has requested enabling task isolation. > + * We take whatever steps are needed to avoid being interrupted later: > + * drain the lru pages, stop the scheduler tick, etc. More > + * functionality may be added here later to avoid other types of > + * interrupts from other kernel subsystems. > + * > + * If we can't enable task isolation, we update the syscall return > + * value with an appropriate error. > + */ > +void task_isolation_start(void) > +{ > + int error; > + > + /* > + * We should only be called in STATE_NORMAL (isolation disabled), > + * on our way out of the kernel from the prctl() that turned it on. > + * If we are exiting from the kernel in another state, it means we > + * made it back into the kernel without disabling task isolation, > + * and we should investigate how (and in any case disable task > + * isolation at this point). We are clearly not on the path back > + * from the prctl() so we don't touch the syscall return value. > + */ > + if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) { > + /* Increment counter, this will allow isolation breaking */ > + if (atomic_inc_return(&per_cpu(isol_counter, > + smp_processor_id())) > 1) { > + atomic_dec(&per_cpu(isol_counter, smp_processor_id())); > + } > + atomic_inc(&per_cpu(isol_counter, smp_processor_id())); > + stop_isolation(current); > + return; > + } > + > + /* > + * Must be affinitized to a single core with task isolation possible. > + * In principle this could be remotely modified between the prctl() > + * and the return to userspace, so we have to check it here. > + */ > + if (current->nr_cpus_allowed != 1 || > + !is_isolation_cpu(smp_processor_id())) { > + error = -EINVAL; > + goto error; > + } > + > + /* If the vmstat delayed work is not canceled, we have to try again. */ > + if (!vmstat_idle()) { > + error = -EAGAIN; > + goto error; > + } > + > + /* Try to stop the dynamic tick. */ > + error = try_stop_full_tick(); > + if (error) > + goto error; > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + > + /* Increment counter, this will allow isolation breaking */ > + if (atomic_inc_return(&per_cpu(isol_counter, > + smp_processor_id())) > 1) { > + atomic_dec(&per_cpu(isol_counter, smp_processor_id())); > + } > + > + /* Record isolated task IDs and name */ > + record_curr_isolated_task(); > + > + /* Copy flags with task isolation enabled */ > + this_cpu_write(tsk_thread_flags_cache, > + READ_ONCE(task_thread_info(current)->flags)); > + > + current->task_isolation_state = STATE_ISOLATED; > + return; > + > +error: > + /* Increment counter, this will allow isolation breaking */ > + if (atomic_inc_return(&per_cpu(isol_counter, > + smp_processor_id())) > 1) { > + atomic_dec(&per_cpu(isol_counter, smp_processor_id())); > + } > + stop_isolation(current); > + syscall_set_return_value(current, current_pt_regs(), error, 0); > +} > + > +/* Stop task isolation on the remote task and send it a signal. */ > +static void send_isolation_signal(struct task_struct *task) > +{ > + int flags = task->task_isolation_flags; > + kernel_siginfo_t info = { > + .si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL, > + }; > + > + stop_isolation(task); > + send_sig_info(info.si_signo, &info, task); > +} > + > +/* Only a few syscalls are valid once we are in task isolation mode. */ > +static bool is_acceptable_syscall(int syscall) > +{ > + /* No need to incur an isolation signal if we are just exiting. */ > + if (syscall == __NR_exit || syscall == __NR_exit_group) > + return true; > + > + /* Check to see if it's the prctl for isolation. */ > + if (syscall == __NR_prctl) { > + unsigned long arg[SYSCALL_MAX_ARGS]; > + > + syscall_get_arguments(current, current_pt_regs(), arg); > + if (arg[0] == PR_TASK_ISOLATION) > + return true; > + } > + > + return false; > +} > + > +/* > + * This routine is called from syscall entry, prevents most syscalls > + * from executing, and if needed raises a signal to notify the process. > + * > + * Note that we have to stop isolation before we even print a message > + * here, since otherwise we might end up reporting an interrupt due to > + * kicking the printk handling code, rather than reporting the true > + * cause of interrupt here. > + * > + * The message is not suppressed by previous remotely triggered > + * messages. > + */ > +int task_isolation_syscall(int syscall) > +{ > + struct task_struct *task = current; > + > + if (is_acceptable_syscall(syscall)) { > + stop_isolation(task); > + return 0; > + } > + > + send_isolation_signal(task); > + > + pr_task_isol_warn(smp_processor_id(), > + "task_isolation lost due to syscall %d\n", > + syscall); > + debug_dump_stack(); > + > + syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1); > + return -1; > +} > + > +/* > + * This routine is called from any exception or irq that doesn't > + * otherwise trigger a signal to the user process (e.g. page fault). > + * > + * Messages will be suppressed if there is already a reported remote > + * cause for isolation breaking, so we don't generate multiple > + * confusingly similar messages about the same event. > + */ > +void _task_isolation_interrupt(const char *fmt, ...) > +{ > + struct task_struct *task = current; > + va_list args; > + char buf[100]; > + > + /* RCU should have been enabled prior to this point. */ > + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); > + > + /* Are we exiting isolation already? */ > + if (atomic_read(&per_cpu(isol_exit_counter, smp_processor_id())) != 0) { > + task->task_isolation_state = STATE_NORMAL; > + return; > + } > + /* > + * Avoid reporting interrupts that happen after we have prctl'ed > + * to enable isolation, but before we have returned to userspace. > + */ > + if (task->task_isolation_state == STATE_NORMAL) > + return; > + > + va_start(args, fmt); > + vsnprintf(buf, sizeof(buf), fmt, args); > + va_end(args); > + > + /* Handle NMIs minimally, since we can't send a signal. */ > + if (in_nmi()) { > + pr_task_isol_err(smp_processor_id(), > + "isolation: in NMI; not delivering signal\n"); > + } else { > + send_isolation_signal(task); > + } > + > + if (pr_task_isol_warn_supp(smp_processor_id(), > + "task_isolation lost due to %s\n", buf)) > + debug_dump_stack(); > +} > + > +/* > + * Called before we wake up a task that has a signal to process. > + * Needs to be done to handle interrupts that trigger signals, which > + * we don't catch with task_isolation_interrupt() hooks. > + * > + * This message is also suppressed if there was already a remotely > + * caused message about the same isolation breaking event. > + */ > +void _task_isolation_signal(struct task_struct *task) > +{ > + struct isol_task_desc *desc; > + int ind, cpu; > + bool do_warn = (task->task_isolation_state == STATE_ISOLATED); > + > + cpu = task_cpu(task); > + desc = &per_cpu(isol_task_descs, cpu); > + ind = atomic_read(&desc->curr_index) & 1; > + if (desc->warned[ind]) > + do_warn = false; > + > + stop_isolation(task); > + > + if (do_warn) { > + pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n", > + task->comm, task->tgid, task->pid, cpu); > + debug_dump_stack(); > + } > +} > + > +/* > + * Generate a stack backtrace if we are going to interrupt another task > + * isolation process. > + */ > +void task_isolation_remote(int cpu, const char *fmt, ...) > +{ > + struct task_struct *curr_task; > + va_list args; > + char buf[200]; > + > + if (!is_isolation_cpu(cpu) || !task_isolation_on_cpu(cpu)) > + return; > + > + curr_task = current; > + > + va_start(args, fmt); > + vsnprintf(buf, sizeof(buf), fmt, args); > + va_end(args); > + if (pr_task_isol_warn(cpu, > + "task_isolation lost due to %s by %s/%d/%d on cpu %d\n", > + buf, > + curr_task->comm, curr_task->tgid, > + curr_task->pid, smp_processor_id())) > + debug_dump_stack(); > +} > + > +/* > + * Generate a stack backtrace if any of the cpus in "mask" are running > + * task isolation processes. > + */ > +void task_isolation_remote_cpumask(const struct cpumask *mask, > + const char *fmt, ...) > +{ > + struct task_struct *curr_task; > + cpumask_var_t warn_mask; > + va_list args; > + char buf[200]; > + int cpu, first_cpu; > + > + if (task_isolation_map == NULL || > + !zalloc_cpumask_var(&warn_mask, GFP_KERNEL)) > + return; > + > + first_cpu = -1; > + for_each_cpu_and(cpu, mask, task_isolation_map) { > + if (task_isolation_on_cpu(cpu)) { > + if (first_cpu < 0) > + first_cpu = cpu; > + else > + cpumask_set_cpu(cpu, warn_mask); > + } > + } > + > + if (first_cpu < 0) > + goto done; > + > + curr_task = current; > + > + va_start(args, fmt); > + vsnprintf(buf, sizeof(buf), fmt, args); > + va_end(args); > + > + if (cpumask_weight(warn_mask) == 0) > + pr_task_isol_warn(first_cpu, > + "task_isolation lost due to %s by %s/%d/%d on cpu %d\n", > + buf, curr_task->comm, curr_task->tgid, > + curr_task->pid, smp_processor_id()); > + else > + pr_task_isol_warn(first_cpu, > + " and cpus %*pbl: task_isolation lost due to %s by %s/%d/%d on cpu %d\n", > + cpumask_pr_args(warn_mask), > + buf, curr_task->comm, curr_task->tgid, > + curr_task->pid, smp_processor_id()); > + debug_dump_stack(); > + > +done: > + free_cpumask_var(warn_mask); > +} > + > +/* > + * Check if given CPU is running isolated task. > + */ > +int task_isolation_on_cpu(int cpu) > +{ > + return test_bit(TIF_TASK_ISOLATION, > + &per_cpu(tsk_thread_flags_cache, cpu)); > +} > + > +/* > + * Set CPUs currently running isolated tasks in CPU mask. > + */ > +void task_isolation_cpumask(struct cpumask *mask) > +{ > + int cpu; > + > + if (task_isolation_map == NULL) > + return; > + > + for_each_cpu(cpu, task_isolation_map) > + if (task_isolation_on_cpu(cpu)) > + cpumask_set_cpu(cpu, mask); > +} > + > +/* > + * Clear CPUs currently running isolated tasks in CPU mask. > + */ > +void task_isolation_clear_cpumask(struct cpumask *mask) > +{ > + int cpu; > + > + if (task_isolation_map == NULL) > + return; > + > + for_each_cpu(cpu, task_isolation_map) > + if (task_isolation_on_cpu(cpu)) > + cpumask_clear_cpu(cpu, mask); > +} > + > +/* > + * Cleanup procedure. The call to this procedure may be delayed. > + */ > +void task_isolation_cpu_cleanup(void) > +{ > + kick_hrtimer(); > +} > + > +/* > + * Check if cleanup is scheduled on the current CPU, and if so, run it. > + * Intended to be called from notify_resume() or another such callback > + * on the target CPU. > + */ > +void task_isolation_check_run_cleanup(void) > +{ > + int cpu; > + unsigned long flags; > + > + spin_lock_irqsave(&task_isolation_cleanup_lock, flags); > + > + cpu = smp_processor_id(); > + > + if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) { > + cpumask_clear_cpu(cpu, task_isolation_cleanup_map); > + spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags); > + task_isolation_cpu_cleanup(); > + } else > + spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags); > +} > diff --git a/kernel/signal.c b/kernel/signal.c > index 5b2396350dd1..1df57e38c361 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -46,6 +46,7 @@ > #include <linux/livepatch.h> > #include <linux/cgroup.h> > #include <linux/audit.h> > +#include <linux/isolation.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/signal.h> > @@ -758,6 +759,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info) > */ > void signal_wake_up_state(struct task_struct *t, unsigned int state) > { > + task_isolation_signal(t); > set_tsk_thread_flag(t, TIF_SIGPENDING); > /* > * TASK_WAKEKILL also means wake it up in the stopped/traced/killable > diff --git a/kernel/sys.c b/kernel/sys.c > index f9bc5c303e3f..0a4059a8c4f9 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -42,6 +42,7 @@ > #include <linux/syscore_ops.h> > #include <linux/version.h> > #include <linux/ctype.h> > +#include <linux/isolation.h> > > #include <linux/compat.h> > #include <linux/syscalls.h> > @@ -2513,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > > error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER; > break; > + case PR_TASK_ISOLATION: > + if (arg3 || arg4 || arg5) > + return -EINVAL; > + error = task_isolation_request(arg2); > + break; > default: > error = -EINVAL; > break; > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c > index 3a609e7344f3..5bb98f39bde6 100644 > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -30,6 +30,7 @@ > #include <linux/syscalls.h> > #include <linux/interrupt.h> > #include <linux/tick.h> > +#include <linux/isolation.h> > #include <linux/err.h> > #include <linux/debugobjects.h> > #include <linux/sched/signal.h> > @@ -721,6 +722,19 @@ static void retrigger_next_event(void *arg) > raw_spin_unlock(&base->lock); > } > > +#ifdef CONFIG_TASK_ISOLATION > +void kick_hrtimer(void) > +{ > + unsigned long flags; > + > + preempt_disable(); > + local_irq_save(flags); > + retrigger_next_event(NULL); > + local_irq_restore(flags); > + preempt_enable(); > +} > +#endif > + > /* > * Switch to high resolution mode > */ > @@ -868,8 +882,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram) > void clock_was_set(void) > { > #ifdef CONFIG_HIGH_RES_TIMERS > +#ifdef CONFIG_TASK_ISOLATION > + struct cpumask mask; > + > + cpumask_clear(&mask); > + task_isolation_cpumask(&mask); > + cpumask_complement(&mask, &mask); > + /* > + * Retrigger the CPU local events everywhere except CPUs > + * running isolated tasks. > + */ > + on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1); > +#else > /* Retrigger the CPU local events everywhere */ > on_each_cpu(retrigger_next_event, NULL, 1); > +#endif > #endif > timerfd_clock_was_set(); > } > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index a792d21cac64..1d4dec9d3ee7 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -882,6 +882,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts) > #endif > } > > +#ifdef CONFIG_TASK_ISOLATION > +int try_stop_full_tick(void) > +{ > + int cpu = smp_processor_id(); > + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); > + > + /* For an unstable clock, we should return a permanent error code. */ > + if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE) > + return -EINVAL; > + > + if (!can_stop_full_tick(cpu, ts)) > + return -EAGAIN; > + > + tick_nohz_stop_sched_tick(ts, cpu); > + return 0; > +} > +#endif > + > static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) > { > /* > -- > 2.20.1 >