On Tue, Aug 9, 2016 at 9:13 AM, Boqun Feng <boqun.feng@xxxxxxxxx> wrote: > On Wed, Aug 03, 2016 at 10:03:32PM -0700, Andy Lutomirski wrote: >> On Wed, Aug 3, 2016 at 9:27 PM, Boqun Feng <boqun.feng@xxxxxxxxx> wrote: >> > On Wed, Aug 03, 2016 at 09:37:57AM -0700, Andy Lutomirski wrote: >> >> On Wed, Aug 3, 2016 at 5:27 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >> >> > On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote: >> >> >> We really care about preemption here. Every migration implies a >> >> >> preemption from a user-space perspective. If we would only care >> >> >> about keeping the CPU id up-to-date, hooking into migration would be >> >> >> enough. But since we want atomicity guarantees for restartable >> >> >> sequences, we need to hook into preemption. >> >> > >> >> >> It allows user-space to perform update operations on per-cpu data without >> >> >> requiring heavy-weight atomic operations. >> >> > >> >> > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86. >> >> > >> >> > It is however on PPC and possibly other architectures, so in name of >> >> > simplicity supporting only the one variant makes sense. >> >> > >> >> >> >> I wouldn't want to depend on CMPXCHG. But imagine we had primitives >> >> that were narrower than the full abort-on-preemption primitive. >> >> Specifically, suppose we had abort if (actual cpu != expected_cpu || >> >> *aptr != aval). We could do things like: >> >> >> >> expected_cpu = cpu; >> >> aval = NULL; // disarm for now >> >> begin(); >> >> aval = event_count[cpu] + 1; >> >> event_count[cpu] = aval; >> >> event_count[cpu]++; >> > >> > This line is redundant, right? Because it will guarantee a failure even >> > in no-contention cases. >> > >> >> >> >> ... compute something ... >> >> >> >> // arm the rest of it >> >> aptr = &event_count[cpu]; >> >> if (*aptr != aval) >> >> goto fail; >> >> >> >> *thing_im_writing = value_i_computed; >> >> end(); >> >> >> >> The idea here is that we don't rely on the scheduler to increment the >> >> event count at all, which means that we get to determine the scope of >> >> what kinds of access conflicts we care about ourselves. >> >> >> > >> > If we increase the event count in userspace, how could we prevent two >> > userspace threads from racing on the event_count[cpu] field? For >> > example: >> > >> > CPU 0 >> > ================ >> > {event_count[0] is initially 0} >> > >> > [Thread 1] >> > begin(); >> > aval = event_count[cpu] + 1; // 1 >> > >> > (preempted) >> > [Thread 2] >> > begin(); >> > aval = event_count[cpu] + 1; // 1, too >> > event_count[cpu] = aval; // event_count[0] is 1 >> > >> >> You're right :( This would work with an xadd instruction, but that's >> very slow and doesn't exist on most architectures. It could also work >> if we did: >> >> aval = some_tls_value++; >> >> where some_tls_value is set up such that no two threads could ever end >> up with the same values (using high bits as thread ids, perhaps), but >> that's messy. Maybe my idea is no good. > > This is a little more complex, plus I failed to find a way to do an > atomic "if (*aptr == aval) *b = c" in userspace ;-( > But the kernel might be able to help using something similar to this patchset. > However, I'm thinking maybe we can use some tricks to avoid unnecessary > aborts-on-preemption. > > First of all, I notice we haven't make any constraint on what kind of > memory objects could be "protected" by rseq critical sections yet. And I > think this is something we should decide before adding this feature into > kernel. > > We can do some optimization if we have some constraints. For example, if > the memory objects inside the rseq critical sections could only be > modified by userspace programs, we therefore don't need to abort > immediately when userspace task -> kernel task context switch. True, although trying to do a syscall in an rseq critical section seems like a bad idea in general. > > Further more, if the memory objects inside the rseq critical sections > could only be modified by userspace programs that have registered their > rseq structures, we don't need to abort immediately between the context > switches between two rseq-unregistered tasks or one rseq-registered > task and one rseq-unregistered task. > > Instead, we do tricks as follow: > > defining a percpu pointer in kernel: > > DEFINE_PER_CPU(struct task_struct *, rseq_owner); > > and a cpu field in struct task_struct: > > struct task_struct { > ... > #ifdef CONFIG_RSEQ > struct rseq __user *rseq; > uint32_t rseq_event_counter; > int rseq_cpu; > #endif > ... > }; > > (task_struct::rseq_cpu should be initialized as -1.) > > each time at sched out(in rseq_sched_out()), we do something like: > > if (prev->rseq) { > raw_cpu_write(rseq_owner, prev); > prev->rseq_cpu = smp_processor_id(); > } > > each time sched in(in rseq_handle_notify_resume()), we do something > like: > > if (current->rseq && > (this_cpu_read(rseq_owner) != current || > current->rseq_cpu != smp_processor_id())) > __rseq_handle_notify_resume(regs); > > (Also need to modify rseq_signal_deliver() to call > __rseq_handle_notify_resume() directly). > > > I think this could save some unnecessary aborts-on-preemption, however, > TBH, I'm too sleepy to verify every corner case. Will recheck this > tomorrow. Interesting. That could help a bit, although it would help less if everyone started using rseq. But do we need to protect MAP_SHARED objects? If not, maybe we could only track context switches between different tasks sharing the same mm. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html