On 07/21/2014 09:01 PM, Thomas Gleixner wrote:
On Mon, 21 Jul 2014, Darren Hart wrote:
On 7/21/14, 14:47, "Thomas Gleixner"<tglx@xxxxxxxxxxxxx> wrote:
On Mon, 21 Jul 2014, Andy Lutomirski wrote:
On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra<peterz@xxxxxxxxxxxxx>
wrote:
All this is predicated on the fact that syscalls are 'expensive'.
Weren't syscalls only 100s of cycles? All this bitmap mucking is far
more expensive due to cacheline misses, which due to the size of the
things is almost guaranteed.
120 - 300 cycles for me, unless tracing happens, and I'm working on
reducing the incidence of tracing.
So it's a non issue indeed and definitely not worth the trouble of
that extra storage, the scheduler overhead, etc.
Summary: Once you can't take it atomically in user space, you've lost
anyway. And we are better off to do the magic spinning in
kernel where we have all the information accessible already.
And we have such an implementation with the FUTEX_LOCK_ADAPTIVE code we
discussed back in Oct 2010 (purely kernel, no VDSO), updated with some of
your and other's comments:
http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/f
utex-lock/v7
I can work on forward porting this series to current mainline (post recent
security fixes) and cleaning up the commentary and such if people are
interested in seeing this implementation (based on Peter Z's spinning
mutex work iirc) resurrected...
No. We really want to avoid more magic hackery in the futex code.
Lets sit down and think about what we need:
1) Support for all existing features
2) More fine grained locking
3) Optimistic spinning
@1: No discussion about that. Period.
We are not going to introduce some new futex opcode, which is
optimized for a microbenchmark and ignoring all of the pain we
went through in the last 10 years. No way, especially after the
recent security disaster.
@2 and @3: Yes, we want that.
But again, we don't add fine grained locking just for some half
baken new opcode. No, we are adding it where it makes the most
sense and lets us reuse most of the code.
I can predict your question, how that should work :)
If we want to have fine grained locking beyond the futex hash
buckets and that's something we discussed before, you need a state
representation of the user space futex in the kernel. That's what
Waiman added as well, just in a way that's beyond repair.
The charm of the futex hash buckets is that they do not create
kernel state because the futex_q which is enqueued into the bucket
is on the task stack of the task which is blocked on the futex.
So much for the theory, because that stopped to be true when we
added support for PI futexes. They already create kernel internal
state which is not magically removed when the thread exits the
syscall. It's called pi_state.
Now the question is how is this related to the non PI magic
spinning optimization? Very much so.
Simply because the handling of pi_state already covers ALL the
corner cases which come with the extra kernel internal state and
they provide the full function set of futexes required by the
various (ab)use cases including the requeue functionality which is
required for condvars. It even supports caching of the pi_state
struct in a way which makes sense, rather than blindly slapping a
cached entry onto each hash bucket.
So it's not a really mind boggling thought experiment to think
about this magic new feature as a subset of the already existing
PI futex code. It is a subset, just minus the PI portion plus the
more fine grained locking.
So the right solution is to rework the existing code.
1) Make pi_state the entity which gets enqueued into the hash bucket
and manage the waiters in a pi_state internal queue.
Straight forward problem as it is still protected by the hash
bucket lock.
2) Add private locking to the pi_state to reduce the lock contention
on the hash bucket lock.
Straight forward conversion as well
3) Make mutex a subset of rt_mutex
That makes a lot of sense, simply because a mutex is the same as a
rtmutex just without the PI part.
Just for the record before anyone cries murder: The original mutex
implemention was modeled after the rtmutex one in the first place
and the optimistic spinning was first introduced on rtmutexes by
Gregory Haskins et. al. in the context of the RT tree. Back then it
was too early or we were simply too stupid to see the similarities
and opted for a complete separate implementation.
Yes, I'm aware that this is a non-trivial problem, but if you look
at the details, then the non-trivial part is in the slow path were
stuff actually goes to sleep. The fast path of mutex and rt_mutex
is simply the same. So there is no reason to keep them separate. An
extra conditional in the slow path is not going to hurt at all.
Surely, you might say, that I'm an egoistic bastard, who just wants
to have the benefit of MCS for rtmutex for free. Right you are, but
there is a whole bunch of reasons why this makes tons of sense:
- As I said above, it's almost the same, except for the slow path
details, where a few extra conditionals do not matter
- Sharing the code has the obvious benefits. And it's not only
sharing between rtmutex and mutex, it's also sharing the
underlying optimizations of MCS for the kernel internal users and
the user space exposed ones, i.e. futexes.
- Once unified the desired futex functionality just works with a
minimal impact on the futex code itself.
The futex code does not care whether the underlying primitive has
PI semantics or not, it just cares that all the required features
like proxy locking etc. are in place. And that everything like
timeouts, requeue etc. is just supported.
- The semantics of the new futex functionality has not to be
defined again. We can simply reuse the rather well defined and
well tested straight forward semantics of the existing PI
opcodes.
- That minimizes the risk massively, because all state and error
handling is common code.
- Any update to the underlying primitives will benefit ALL usage
sites and reduces the copy and paste induced hell of MCS code
sprinkled all over the kernel.
4) Enable user space to make use of it.
Whether this will be a separate opcode or just a new flag to the
existing PI opcodes is going to be just a bikeshedding question at
that point. :)
The important point is, that we do NOT grow any new unmaintainable
warts to the futex code. And the 650+ new lines of hackery which come
with the proposed patch set are in no way acceptable, especially as
they only cover the minimalistic functionality set of futexes. Not to
talk about the other issues I've observed on the first glance.
So before anyone comes up with a "solution" for all of this tomorrow
afternoon in form of another half baken patch, please sit back mull it
in your head and lets have a proper discussion about the approach
first.
Thanks,
tglx
Thank for your thorough analysis and suggestions on what to do to
support spinning futexes. You certainly know more about the internal
working of futex than most of us. I can live with what you have
suggested. My patch is just a proof of concept piece to demonstrate
optimistic spinning on futex is something worthwhile to do. I think I
have achieved my goal of stirring interest in this area. My next step
will be to look into the direction of what you have suggested and figure
out what actual code changes will be needed.
Thank again for your help.
-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html