Ok a few more issues. This never stops. Here's the basic spinloop: On Wed, 7 Jan 2009, Peter Zijlstra wrote: > > + for (;;) { > + struct task_struct *l_owner; > + > + /* Stop spinning when there's a pending signal. */ > + if (signal_pending_state(task->state, task)) > + break; This just can't be right. If we have signal latency issues, we have way way _way_ bigger issues. So the correct test is to just break out if we have any work at all, whether it's signals or rescheduling. Yes, you probably never saw that in RT, but that was because you have preemption on etc. And obviously usually there isn't enough constant contention to trigger anything like this anyway, so usually the wait is short either because the owner ends up sleeping, or because the owner releases the semaphore. But you could have starvation issues where everybody is all on CPU, and other processes constantly get the semaphore, and latency is absolutely HORRIBLE because of this guy just busy-looping all the time. So at a minimum, add a couple of "if (need_resched())" calls. However, the other thing that really strikes me is that we've done all that insane work to add ourselves to the waiter lists etc, and _then_ we start spinning. Again, that can't be right: if there are other people on the waiter list, we should not spin - because it's going to be really really unfair to take the lock from them! So we should do all this spinning ONLY IF there are no other waiters, ie everybody else is spinning too! Anything else sounds just horribly broken due to the extreme unfairness of it all. Those things are supposed to be fair. What does that mean? It means that the spinning loop should also check that the wait-queue was empty. But it can't do that, because the way this whole adaptive spinning was done is _inside_ the slow path that already added itself to the waiting list - so you cannot tell if there are other spinners. So I think that the spinning is actually done in the wrong place. It _should_ be done at the very top of __mutex_lock_common, _before_ we've done that lock->wait_list thing etc. That also makes us only spin at the beginning, and if we start sleeping, we don't suddenly go back to spinning again. That in turn would mean that we should try to do this spinning without any locks. I think it's doable. I think it's also doable to spin _without_ dirtying the cacheline further and bouncing it due to the spinlock. We can really do the spinning by just reading that lock, and the only time we want to write is when we see the lock releasing. Something like this.. NOTE NOTE NOTE! This does _not_ implement "spin_on_owner(lock, owner);". That's the thing that the scheduler needs to do, with the extra interesting part of needing to be able to access the thread_info struct without knowing whether it might possibly have exited already. But we can do that with __get_user(thread_info->cpu) (very unlikely page fault protection due to the possibility of CONFIG_DEBUG_PAGEALLOC) and then validating the cpu. It it's in range, we can use it and verify whether cpu_rq(cpu)->curr has that thread_info. So we can do all that locklessly and optimistically, just going back and verifying the results later. This is why "thread_info" is actually a better thing to use than "task_struct" - we can look up the cpu in it with a simple dereference. We knew the pointer _used_ to be valid, so in any normal situation, it will never page fault (and if you have CONFIG_DEBUG_PAGEALLOC and hit a very unlucky race, then performance isn't your concern anyway: we just need to make the page fault be non-lethal ;) Linus --- kernel/mutex.c | 30 ++++++++++++++++++++++++++++-- 1 files changed, 28 insertions(+), 2 deletions(-) diff --git a/kernel/mutex.c b/kernel/mutex.c index 4f45d4b..65525d0 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -120,6 +120,8 @@ void __sched mutex_unlock(struct mutex *lock) EXPORT_SYMBOL(mutex_unlock); +#define MUTEX_SLEEPERS (-1000) + /* * Lock a mutex (possibly interruptible), slowpath: */ @@ -132,6 +134,30 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, unsigned int old_val; unsigned long flags; +#ifdef CONFIG_SMP + /* Optimistic spinning.. */ + for (;;) { + struct thread_struct *owner; + int oldval = atomic_read(&lock->count); + + if (oldval <= MUTEX_SLEEPERS) + break; + if (oldval == 1) { + oldval = atomic_cmpxchg(&lock->count, oldval, 0); + if (oldval == 1) { + lock->owner = task_thread_info(task); + return 0; + } + } else { + /* See who owns it, and spin on him if anybody */ + owner = lock->owner; + if (owner) + spin_on_owner(lock, owner); + } + cpu_relax(); + } +#endif + spin_lock_mutex(&lock->wait_lock, flags); debug_mutex_lock_common(lock, &waiter); @@ -142,7 +168,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, list_add_tail(&waiter.list, &lock->wait_list); waiter.task = task; - old_val = atomic_xchg(&lock->count, -1); + old_val = atomic_xchg(&lock->count, MUTEX_SLEEPERS); if (old_val == 1) goto done; @@ -158,7 +184,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * that when we release the lock, we properly wake up the * other waiters: */ - old_val = atomic_xchg(&lock->count, -1); + old_val = atomic_xchg(&lock->count, MUTEX_SLEEPERS); if (old_val == 1) break; -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html