Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex

Waiman Long <waiman.long@xxxxxx> · Tue, 22 Jul 2014 16:21:07 -0400

On 07/21/2014 09:01 PM, Thomas Gleixner wrote:
On Mon, 21 Jul 2014, Darren Hart wrote:
On 7/21/14, 14:47, "Thomas Gleixner"<tglx@xxxxxxxxxxxxx>  wrote:

On Mon, 21 Jul 2014, Andy Lutomirski wrote:
On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra<peterz@xxxxxxxxxxxxx>
wrote:
All this is predicated on the fact that syscalls are 'expensive'.
Weren't syscalls only 100s of cycles? All this bitmap mucking is far
more expensive due to cacheline misses, which due to the size of the
things is almost guaranteed.
120 - 300 cycles for me, unless tracing happens, and I'm working on
reducing the incidence of tracing.
So it's a non issue indeed and definitely not worth the trouble of
that extra storage, the scheduler overhead, etc.

Summary: Once you can't take it atomically in user space, you've lost
	 anyway. And we are better off to do the magic spinning in
	 kernel where we have all the information accessible already.
And we have such an implementation with the FUTEX_LOCK_ADAPTIVE code we
discussed back in Oct 2010 (purely kernel, no VDSO), updated with some of
your and other's comments:

http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/f
utex-lock/v7

I can work on forward porting this series to current mainline (post recent
security fixes) and cleaning up the commentary and such if people are
interested in seeing this implementation (based on Peter Z's spinning
mutex work iirc) resurrected...
No. We really want to avoid more magic hackery in the futex code.

Lets sit down and think about what we need:

  1) Support for all existing features

  2) More fine grained locking

  3) Optimistic spinning

@1: No discussion about that. Period.

     We are not going to introduce some new futex opcode, which is
     optimized for a microbenchmark and ignoring all of the pain we
     went through in the last 10 years. No way, especially after the
     recent security disaster.

@2 and @3: Yes, we want that.

     But again, we don't add fine grained locking just for some half
     baken new opcode. No, we are adding it where it makes the most
     sense and lets us reuse most of the code.

     I can predict your question, how that should work :)

     If we want to have fine grained locking beyond the futex hash
     buckets and that's something we discussed before, you need a state
     representation of the user space futex in the kernel. That's what
     Waiman added as well, just in a way that's beyond repair.

     The charm of the futex hash buckets is that they do not create
     kernel state because the futex_q which is enqueued into the bucket
     is on the task stack of the task which is blocked on the futex.

     So much for the theory, because that stopped to be true when we
     added support for PI futexes. They already create kernel internal
     state which is not magically removed when the thread exits the
     syscall. It's called pi_state.

     Now the question is how is this related to the non PI magic
     spinning optimization? Very much so.

     Simply because the handling of pi_state already covers ALL the
     corner cases which come with the extra kernel internal state and
     they provide the full function set of futexes required by the
     various (ab)use cases including the requeue functionality which is
     required for condvars. It even supports caching of the pi_state
     struct in a way which makes sense, rather than blindly slapping a
     cached entry onto each hash bucket.

     So it's not a really mind boggling thought experiment to think
     about this magic new feature as a subset of the already existing
     PI futex code. It is a subset, just minus the PI portion plus the
     more fine grained locking.

So the right solution is to rework the existing code.

1) Make pi_state the entity which gets enqueued into the hash bucket
    and manage the waiters in a pi_state internal queue.

    Straight forward problem as it is still protected by the hash
    bucket lock.

2) Add private locking to the pi_state to reduce the lock contention
    on the hash bucket lock.

    Straight forward conversion as well

3) Make mutex a subset of rt_mutex

    That makes a lot of sense, simply because a mutex is the same as a
    rtmutex just without the PI part.

    Just for the record before anyone cries murder: The original mutex
    implemention was modeled after the rtmutex one in the first place
    and the optimistic spinning was first introduced on rtmutexes by
    Gregory Haskins et. al. in the context of the RT tree. Back then it
    was too early or we were simply too stupid to see the similarities
    and opted for a complete separate implementation.

    Yes, I'm aware that this is a non-trivial problem, but if you look
    at the details, then the non-trivial part is in the slow path were
    stuff actually goes to sleep. The fast path of mutex and rt_mutex
    is simply the same. So there is no reason to keep them separate. An
    extra conditional in the slow path is not going to hurt at all.

    Surely, you might say, that I'm an egoistic bastard, who just wants
    to have the benefit of MCS for rtmutex for free. Right you are, but
    there is a whole bunch of reasons why this makes tons of sense:

    - As I said above, it's almost the same, except for the slow path
      details, where a few extra conditionals do not matter

    - Sharing the code has the obvious benefits. And it's not only
      sharing between rtmutex and mutex, it's also sharing the
      underlying optimizations of MCS for the kernel internal users and
      the user space exposed ones, i.e. futexes.

    - Once unified the desired futex functionality just works with a
      minimal impact on the futex code itself.

      The futex code does not care whether the underlying primitive has
      PI semantics or not, it just cares that all the required features
      like proxy locking etc. are in place. And that everything like
      timeouts, requeue etc. is just supported.

    - The semantics of the new futex functionality has not to be
      defined again. We can simply reuse the rather well defined and
      well tested straight forward semantics of the existing PI
      opcodes.

    - That minimizes the risk massively, because all state and error
      handling is common code.

    - Any update to the underlying primitives will benefit ALL usage
      sites and reduces the copy and paste induced hell of MCS code
      sprinkled all over the kernel.

4) Enable user space to make use of it.

    Whether this will be a separate opcode or just a new flag to the
    existing PI opcodes is going to be just a bikeshedding question at
    that point. :)

The important point is, that we do NOT grow any new unmaintainable
warts to the futex code. And the 650+ new lines of hackery which come
with the proposed patch set are in no way acceptable, especially as
they only cover the minimalistic functionality set of futexes. Not to
talk about the other issues I've observed on the first glance.

So before anyone comes up with a "solution" for all of this tomorrow
afternoon in form of another half baken patch, please sit back mull it
in your head and lets have a proper discussion about the approach
first.

Thanks,

	tglx

Thank for your thorough analysis and suggestions on what to do to 
support spinning futexes. You certainly know more about the internal 
working of futex than most of us. I can  live with what you have 
suggested. My patch is just a proof of concept piece to demonstrate 
optimistic spinning on futex is something worthwhile to do. I think I 
have achieved my goal of stirring interest in this area. My next step 
will be to look into the direction of what you have suggested and figure 
out what actual code changes will be needed.

Thank again for your help.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html