Re: [RFC PATCH v7 1/7] Restartable sequences system call

Boqun Feng <boqun.feng@xxxxxxxxx> · Thu, 11 Aug 2016 12:54:16 +0800

On Wed, Aug 10, 2016 at 05:33:44PM +0000, Mathieu Desnoyers wrote:
> ----- On Aug 9, 2016, at 12:13 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote:
> 
> <snip>
> 
> > 
> > However, I'm thinking maybe we can use some tricks to avoid unnecessary
> > aborts-on-preemption.
> > 
> > First of all, I notice we haven't make any constraint on what kind of
> > memory objects could be "protected" by rseq critical sections yet. And I
> > think this is something we should decide before adding this feature into
> > kernel.
> > 
> > We can do some optimization if we have some constraints. For example, if
> > the memory objects inside the rseq critical sections could only be
> > modified by userspace programs, we therefore don't need to abort
> > immediately when userspace task -> kernel task context switch.
> 
> The rseq_owner per-cpu variable and rseq_cpu field in task_struct you
> propose below would indeed take care of this scenario.
> 
> > 
> > Further more, if the memory objects inside the rseq critical sections
> > could only be modified by userspace programs that have registered their
> > rseq structures, we don't need to abort immediately between the context
> > switches between two rseq-unregistered tasks or one rseq-registered
> > task and one rseq-unregistered task.
> > 
> > Instead, we do tricks as follow:
> > 
> > defining a percpu pointer in kernel:
> > 
> > DEFINE_PER_CPU(struct task_struct *, rseq_owner);
> > 
> > and a cpu field in struct task_struct:
> > 
> >	struct task_struct {
> >	...
> >	#ifdef CONFIG_RSEQ
> >		struct rseq __user *rseq;
> >		uint32_t rseq_event_counter;
> >		int rseq_cpu;
> >	#endif
> >	...
> >	};
> > 
> > (task_struct::rseq_cpu should be initialized as -1.)
> > 
> > each time at sched out(in rseq_sched_out()), we do something like:
> > 
> >	if (prev->rseq) {
> >		raw_cpu_write(rseq_owner, prev);
> >		prev->rseq_cpu = smp_processor_id();
> >	}
> > 
> > each time sched in(in rseq_handle_notify_resume()), we do something
> > like:
> > 
> >	if (current->rseq &&
> >	    (this_cpu_read(rseq_owner) != current ||
> >	     current->rseq_cpu != smp_processor_id()))
> >		__rseq_handle_notify_resume(regs);
> > 
> > (Also need to modify rseq_signal_deliver() to call
> > __rseq_handle_notify_resume() directly).
> > 
> > 
> > I think this could save some unnecessary aborts-on-preemption, however,
> > TBH, I'm too sleepy to verify every corner case. Will recheck this
> > tomorrow.
> 
> This adds extra fields to the task struct, per-cpu rseq_owner pointers,
> and hooks into sched_in which are not needed otherwise, all this to
> eliminate unneeded abort-on-preemption.
> 
> If we look at the single-stepping use-case, this means that gdb would
> only be able to single-step applications as long as neither itself, nor
> any of its libraries, use rseq. This seems to be quite fragile. I prefer
> requiring rseq users to implement a fallback to locking which progresses
> in every situation rather than adding complexity and overhead trying
> lessen the odds of triggering the restart.
> 
> Simply lessening the odds of triggering the restart without a design that
> ensures progress even in restart cases seems to make the lack-of-progress
> problem just harder to debug when it will surface in real life.
> 

Fair enough.

I did my own research of the mechanism I proposed. The patch is attached
at the end of the email. Unfortunately, there is no noticeable
performance gain for the current benchmark. One possible reason may be:
The rseq critical sections in current benchmark are quite small, which
makes retrying is not that expensive.

From another angle, this may imply that in current senarios,
abort-on-preemption doesn't hurt the performance much. But these are
only my two cents.

> Thanks,
> 
> Mathieu
> 

---
 include/linux/sched.h | 18 +++++++++++++++---
 kernel/rseq.c         |  2 ++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5875fdd6edc8..c23e5dee9c60 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1922,6 +1922,7 @@ struct task_struct {
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_event_counter;
+	int rseq_cpu;
 #endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
@@ -3393,6 +3394,8 @@ void cpufreq_remove_update_util_hook(int cpu);
 #endif /* CONFIG_CPU_FREQ */
 
 #ifdef CONFIG_RSEQ
+DECLARE_PER_CPU(struct task_struct *, rseq_owner);
+
 static inline void rseq_set_notify_resume(struct task_struct *t)
 {
 	if (t->rseq)
@@ -3401,7 +3404,9 @@ static inline void rseq_set_notify_resume(struct task_struct *t)
 void __rseq_handle_notify_resume(struct pt_regs *regs);
 static inline void rseq_handle_notify_resume(struct pt_regs *regs)
 {
-	if (current->rseq)
+	if (current->rseq &&
+	    (current != raw_cpu_read(rseq_owner) ||
+	     current->rseq_cpu != smp_processor_id()))
 		__rseq_handle_notify_resume(regs);
 }
 /*
@@ -3415,9 +3420,11 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 	if (clone_flags & CLONE_THREAD) {
 		t->rseq = NULL;
 		t->rseq_event_counter = 0;
+		t->rseq_cpu = -1;
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_event_counter = current->rseq_event_counter;
+		t->rseq_cpu = -1;
 		rseq_set_notify_resume(t);
 	}
 }
@@ -3428,11 +3435,16 @@ static inline void rseq_execve(struct task_struct *t)
 }
 static inline void rseq_sched_out(struct task_struct *t)
 {
-	rseq_set_notify_resume(t);
+	if (t->rseq) {
+		raw_cpu_write(rseq_owner, t);
+		t->rseq_cpu = smp_processor_id();
+		rseq_set_notify_resume(t);
+	}
 }
 static inline void rseq_signal_deliver(struct pt_regs *regs)
 {
-	rseq_handle_notify_resume(regs);
+	if (current->rseq)
+		__rseq_handle_notify_resume(regs);
 }
 #else
 static inline void rseq_set_notify_resume(struct task_struct *t)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 7e4d1d0e9520..0390a57ef0e5 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -100,6 +100,8 @@
  *   F2. Return false.
  */
 
+DEFINE_PER_CPU(struct task_struct *, rseq_owner);
+
 /*
  * The rseq_event_counter allow user-space to detect preemption and
  * signal delivery. It increments at least once before returning to
-- 
2.9.0
Attachment:
signature.asc

Description: PGP signature