----- Original Message ----- > On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk <mtk.manpages@xxxxxxxxx> > wrote: > > [CC += linux-api@] > > > > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers > > <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > >> Expose a new system call allowing userspace threads to register > >> a TLS area used as an ABI between the kernel and userspace to > >> share information required to create efficient per-cpu critical > >> sections in user-space. > >> > >> This ABI consists of a thread-local structure containing: > >> > >> - a nesting count surrounding the critical section, > >> - a signal number to be sent to the thread when preempting a thread > >> with non-zero nesting count, > >> - a flag indicating whether the signal has been sent within the > >> critical section, > >> - an integer where to store the current CPU number, updated whenever > >> the thread is preempted. This CPU number cache is not strictly > >> needed, but performs better than getcpu vdso. > >> > >> This approach is inspired by Paul Turner and Andrew Hunter's work > >> on percpu atomics, which lets the kernel handle restart of critical > >> sections, ref. > >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf > >> > >> What is done differently here compared to percpu atomics: we track > >> a single nesting counter per thread rather than many ranges of > >> instruction pointer values. We deliver a signal to user-space and > >> let the logic of restart be handled in user-space, thus moving > >> the complexity out of the kernel. The nesting counter approach > >> allows us to skip the complexity of interacting with signals that > >> would be otherwise needed with the percpu atomics approach, which > >> needs to know which instruction pointers are preempted, including > >> when preemption occurs on a signal handler nested over an instruction > >> pointer of interest. > >> > > I talked about this kind of thing with PeterZ at LSF/MM, and I was > unable to convince myself that the kernel needs to help at all. To do > this without kernel help, I want to relax the requirements slightly. > With true per-cpu atomic sections, you have a guarantee that you are > either really running on the same CPU for the entire duration of the > atomic section or you abort. I propose a weaker primitive: you > acquire one of an array of locks (probably one per cpu), and you are > guaranteed that, if you don't abort, no one else acquires the same > lock while you hold it. In my proof of concept (https://github.com/compudj/percpu-dev) I actually implement an array of per-cpu lock. The issue here boils down to grabbing this per-cpu lock efficiently. Once the lock is taken, the thread has exclusive access to that per-cpu lock, even if it migrates. > Here's how: > > Create an array of user-managed locks, one per cpu. Call them lock[i] > for 0 <= i < ncpus. > > To acquire, look up your CPU number. Then, atomically, check that > lock[cpu] isn't held and, if so, mark it held and record both your tid > and your lock acquisition count. If you learn that the lock *was* > held after all, signal the holder (with kill or your favorite other > mechanism), telling it which lock acquisition count is being aborted. > Then atomically steal the lock, but only if the lock acquisition count > hasn't changed. > > This has a few benefits over the in-kernel approach: > > 1. No kernel patch. > > 2. No unnecessary abort if you are preempted in favor of a thread that > doesn't content for your lock. > > 3. Greatly improved debuggability. > > 4. With long critical sections and heavy load, you can improve > performance by having several locks per cpu and choosing one at > random. > > Is there a reason that a scheme like this doesn't work? What do you mean exactly by "atomically check that lock is not held and, if so, mark it held" ? Do you imply using a lock-prefixed atomic operation ? The goal of this whole restart section approach is to allow grabbing a lock (or doing other sequences of operations ending with a single store) on per-cpu data without having to use slow lock-prefixed atomic operations. > > >> Benchmarking sched_getcpu() vs tls cache approach. Getting the > >> current CPU number: > >> > >> - With Linux vdso: 12.7 ns > >> - With TLS-cached cpu number: 0.3 ns > > Slightly off-topic: try this again on a newer kernel. The vdso should > have gotten a bit faster in 3.19 or 4.0 IIRC. Those benchmarks were done on Linux 4.0. Thanks! Mathieu > > --Andy > -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html