On Wed, Aug 10, 2016 at 05:33:44PM +0000, Mathieu Desnoyers wrote: > ----- On Aug 9, 2016, at 12:13 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote: > > <snip> > > > > > However, I'm thinking maybe we can use some tricks to avoid unnecessary > > aborts-on-preemption. > > > > First of all, I notice we haven't make any constraint on what kind of > > memory objects could be "protected" by rseq critical sections yet. And I > > think this is something we should decide before adding this feature into > > kernel. > > > > We can do some optimization if we have some constraints. For example, if > > the memory objects inside the rseq critical sections could only be > > modified by userspace programs, we therefore don't need to abort > > immediately when userspace task -> kernel task context switch. > > The rseq_owner per-cpu variable and rseq_cpu field in task_struct you > propose below would indeed take care of this scenario. > > > > > Further more, if the memory objects inside the rseq critical sections > > could only be modified by userspace programs that have registered their > > rseq structures, we don't need to abort immediately between the context > > switches between two rseq-unregistered tasks or one rseq-registered > > task and one rseq-unregistered task. > > > > Instead, we do tricks as follow: > > > > defining a percpu pointer in kernel: > > > > DEFINE_PER_CPU(struct task_struct *, rseq_owner); > > > > and a cpu field in struct task_struct: > > > > struct task_struct { > > ... > > #ifdef CONFIG_RSEQ > > struct rseq __user *rseq; > > uint32_t rseq_event_counter; > > int rseq_cpu; > > #endif > > ... > > }; > > > > (task_struct::rseq_cpu should be initialized as -1.) > > > > each time at sched out(in rseq_sched_out()), we do something like: > > > > if (prev->rseq) { > > raw_cpu_write(rseq_owner, prev); > > prev->rseq_cpu = smp_processor_id(); > > } > > > > each time sched in(in rseq_handle_notify_resume()), we do something > > like: > > > > if (current->rseq && > > (this_cpu_read(rseq_owner) != current || > > current->rseq_cpu != smp_processor_id())) > > __rseq_handle_notify_resume(regs); > > > > (Also need to modify rseq_signal_deliver() to call > > __rseq_handle_notify_resume() directly). > > > > > > I think this could save some unnecessary aborts-on-preemption, however, > > TBH, I'm too sleepy to verify every corner case. Will recheck this > > tomorrow. > > This adds extra fields to the task struct, per-cpu rseq_owner pointers, > and hooks into sched_in which are not needed otherwise, all this to > eliminate unneeded abort-on-preemption. > > If we look at the single-stepping use-case, this means that gdb would > only be able to single-step applications as long as neither itself, nor > any of its libraries, use rseq. This seems to be quite fragile. I prefer > requiring rseq users to implement a fallback to locking which progresses > in every situation rather than adding complexity and overhead trying > lessen the odds of triggering the restart. > > Simply lessening the odds of triggering the restart without a design that > ensures progress even in restart cases seems to make the lack-of-progress > problem just harder to debug when it will surface in real life. > Fair enough. I did my own research of the mechanism I proposed. The patch is attached at the end of the email. Unfortunately, there is no noticeable performance gain for the current benchmark. One possible reason may be: The rseq critical sections in current benchmark are quite small, which makes retrying is not that expensive. From another angle, this may imply that in current senarios, abort-on-preemption doesn't hurt the performance much. But these are only my two cents. > Thanks, > > Mathieu > --- include/linux/sched.h | 18 +++++++++++++++--- kernel/rseq.c | 2 ++ 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 5875fdd6edc8..c23e5dee9c60 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1922,6 +1922,7 @@ struct task_struct { #ifdef CONFIG_RSEQ struct rseq __user *rseq; u32 rseq_event_counter; + int rseq_cpu; #endif /* CPU-specific state of this task */ struct thread_struct thread; @@ -3393,6 +3394,8 @@ void cpufreq_remove_update_util_hook(int cpu); #endif /* CONFIG_CPU_FREQ */ #ifdef CONFIG_RSEQ +DECLARE_PER_CPU(struct task_struct *, rseq_owner); + static inline void rseq_set_notify_resume(struct task_struct *t) { if (t->rseq) @@ -3401,7 +3404,9 @@ static inline void rseq_set_notify_resume(struct task_struct *t) void __rseq_handle_notify_resume(struct pt_regs *regs); static inline void rseq_handle_notify_resume(struct pt_regs *regs) { - if (current->rseq) + if (current->rseq && + (current != raw_cpu_read(rseq_owner) || + current->rseq_cpu != smp_processor_id())) __rseq_handle_notify_resume(regs); } /* @@ -3415,9 +3420,11 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) if (clone_flags & CLONE_THREAD) { t->rseq = NULL; t->rseq_event_counter = 0; + t->rseq_cpu = -1; } else { t->rseq = current->rseq; t->rseq_event_counter = current->rseq_event_counter; + t->rseq_cpu = -1; rseq_set_notify_resume(t); } } @@ -3428,11 +3435,16 @@ static inline void rseq_execve(struct task_struct *t) } static inline void rseq_sched_out(struct task_struct *t) { - rseq_set_notify_resume(t); + if (t->rseq) { + raw_cpu_write(rseq_owner, t); + t->rseq_cpu = smp_processor_id(); + rseq_set_notify_resume(t); + } } static inline void rseq_signal_deliver(struct pt_regs *regs) { - rseq_handle_notify_resume(regs); + if (current->rseq) + __rseq_handle_notify_resume(regs); } #else static inline void rseq_set_notify_resume(struct task_struct *t) diff --git a/kernel/rseq.c b/kernel/rseq.c index 7e4d1d0e9520..0390a57ef0e5 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -100,6 +100,8 @@ * F2. Return false. */ +DEFINE_PER_CPU(struct task_struct *, rseq_owner); + /* * The rseq_event_counter allow user-space to detect preemption and * signal delivery. It increments at least once before returning to -- 2.9.0
Attachment:
signature.asc
Description: PGP signature