On Mon, Jul 28, 2014 at 08:58:03PM +0200, Oleg Nesterov wrote: > Off-topic, but... > > On 07/28, Oleg Nesterov wrote: > > > > But we should always call user_exit() unconditionally? > > Frederic, don't we need the patch below? In fact clear_() can be moved > under "if ()" too. and probably copy_process() should clear this flag... > > Or. __context_tracking_task_switch() can simply do > > if (context_tracking_cpu_is_enabled()) > set_tsk_thread_flag(next, TIF_NOHZ); > else > clear_tsk_thread_flag(next, TIF_NOHZ); > > and then we can forget about copy_process(). Or I am totally confused? > > > I am also wondering if we can extend user_return_notifier to handle > enter/exit and kill TIF_NOHZ. > > Oleg. > > --- x/kernel/context_tracking.c > +++ x/kernel/context_tracking.c > @@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru > struct task_struct *next) > { > clear_tsk_thread_flag(prev, TIF_NOHZ); > - set_tsk_thread_flag(next, TIF_NOHZ); > + if (context_tracking_cpu_is_enabled()) > + set_tsk_thread_flag(next, TIF_NOHZ); > } > > #ifdef CONFIG_CONTEXT_TRACKING_FORCE Unfortunately, as long as tasks can migrate in and out a context tracked CPU, we need to track all CPUs. This is because there is always a small shift between hard and soft kernelspace boundaries. Hard boundaries are the real strict boundaries: between "int", "iret" or faulting instructions for example. Soft boundaries are the place where we put our context tracking probes. They are just function calls and a distance between them and hard boundaries is inevitable. So here is a scenario where this is a problem: a task runs on CPU 0, passes the context tracking call before returning from a syscall to userspace, and gets an interrupt. The interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq() after which it is going to resume to userspace. In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that the task is resuming to userspace, because we passed through the context tracking probe already and it was ignored on CPU 0. This might be hackbable by ensuring that irqs are disabled between context tracking calls and actual returns to userspace. It's a nightmare to audit on all archs though, and it makes the context tracking callers less flexible also that only solve the issue for irqs. Exception have a similar problem and we can't mask them.