Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Fri, 22 May 2015 21:34:47 +0000 (UTC)

----- Original Message -----
> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk <mtk.manpages@xxxxxxxxx>
> wrote:
> > [CC += linux-api@]
> >
> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> > <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >> Expose a new system call allowing userspace threads to register
> >> a TLS area used as an ABI between the kernel and userspace to
> >> share information required to create efficient per-cpu critical
> >> sections in user-space.
> >>
> >> This ABI consists of a thread-local structure containing:
> >>
> >> - a nesting count surrounding the critical section,
> >> - a signal number to be sent to the thread when preempting a thread
> >>   with non-zero nesting count,
> >> - a flag indicating whether the signal has been sent within the
> >>   critical section,
> >> - an integer where to store the current CPU number, updated whenever
> >>   the thread is preempted. This CPU number cache is not strictly
> >>   needed, but performs better than getcpu vdso.
> >>
> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> >> on percpu atomics, which lets the kernel handle restart of critical
> >> sections, ref.
> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >>
> >> What is done differently here compared to percpu atomics: we track
> >> a single nesting counter per thread rather than many ranges of
> >> instruction pointer values. We deliver a signal to user-space and
> >> let the logic of restart be handled in user-space, thus moving
> >> the complexity out of the kernel. The nesting counter approach
> >> allows us to skip the complexity of interacting with signals that
> >> would be otherwise needed with the percpu atomics approach, which
> >> needs to know which instruction pointers are preempted, including
> >> when preemption occurs on a signal handler nested over an instruction
> >> pointer of interest.
> >>
> 
> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> unable to convince myself that the kernel needs to help at all.  To do
> this without kernel help, I want to relax the requirements slightly.
> With true per-cpu atomic sections, you have a guarantee that you are
> either really running on the same CPU for the entire duration of the
> atomic section or you abort.  I propose a weaker primitive: you
> acquire one of an array of locks (probably one per cpu), and you are
> guaranteed that, if you don't abort, no one else acquires the same
> lock while you hold it.

In my proof of concept (https://github.com/compudj/percpu-dev) I
actually implement an array of per-cpu lock. The issue here boils
down to grabbing this per-cpu lock efficiently. Once the lock is taken,
the thread has exclusive access to that per-cpu lock, even if it
migrates.

>  Here's how:
> 
> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> for 0 <= i < ncpus.
> 
> To acquire, look up your CPU number.  Then, atomically, check that
> lock[cpu] isn't held and, if so, mark it held and record both your tid
> and your lock acquisition count.  If you learn that the lock *was*
> held after all, signal the holder (with kill or your favorite other
> mechanism), telling it which lock acquisition count is being aborted.
> Then atomically steal the lock, but only if the lock acquisition count
> hasn't changed.
> 
> This has a few benefits over the in-kernel approach:
> 
> 1. No kernel patch.
> 
> 2. No unnecessary abort if you are preempted in favor of a thread that
> doesn't content for your lock.
> 
> 3. Greatly improved debuggability.
> 
> 4. With long critical sections and heavy load, you can improve
> performance by having several locks per cpu and choosing one at
> random.
> 
> Is there a reason that a scheme like this doesn't work?

What do you mean exactly by "atomically check that lock is not
held and, if so, mark it held" ? Do you imply using a lock-prefixed
atomic operation ?

The goal of this whole restart section approach is to allow grabbing
a lock (or doing other sequences of operations ending with a single
store) on per-cpu data without having to use slow lock-prefixed
atomic operations.

> 
> >> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> >> current CPU number:
> >>
> >> - With Linux vdso:            12.7 ns
> >> - With TLS-cached cpu number:  0.3 ns
> 
> Slightly off-topic: try this again on a newer kernel.  The vdso should
> have gotten a bit faster in 3.19 or 4.0 IIRC.

Those benchmarks were done on Linux 4.0.

Thanks!

Mathieu

> 
> --Andy
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html