Hi, Le mercredi 04 mars 2015 à 02:13 +0200, Matthias Bonne a écrit : > I am trying to understand how mutexes work in the kernel, and I think > there might be a race between mutex_trylock() and mutex_unlock(). More > specifically, the race is between the functions > __mutex_trylock_slowpath and __mutex_unlock_common_slowpath (both > defined in kernel/locking/mutex.c). > > Consider the following sequence of events: > > 0. Suppose a mutex is locked by task A and has no waiters. > > 1. Task B calls mutex_trylock(). > > 2. mutex_trylock() calls the architecture-specific > __mutex_fastpath_trylock(), with __mutex_trylock_slowpath() as > fail_fn. > > 3. According to the description of __mutex_fastpath_trylock() (for > example in include/asm-generic/mutex-dec.h), "if the architecture > has no effective trylock variant, it should call the fail_fn > spinlock-based trylock variant unconditionally". So > __mutex_fastpath_trylock() may now call __mutex_trylock_slowpath(). > > 4. Task A releases the mutex. > > 5. Task B, in __mutex_trylock_slowpath, executes: > > /* No need to trylock if the mutex is locked. */ > if (mutex_is_locked(lock)) > return 0; > > Since the mutex is no longer locked, the function continues. > > 6. Task C, which runs on a different cpu than task B, locks the mutex > again. > > 7. Task B, in __mutex_trylock_slowpath(), continues: > > spin_lock_mutex(&lock->wait_lock, flags); > > prev = atomic_xchg(&lock->count, -1); > if (likely(prev == 1)) { > mutex_set_owner(lock); > mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_); > } > > At this point task B holds mutex->wait_lock, prev is 0 (because there > are no waiters other than task B, so the count was 0) and the mutex > count is set to -1. > > 5. Task C calls mutex_unlock() to unlock the mutex. > > 6. mutex_unlock() calls the architecture-specific function > __mutex_fastpath_unlock(), which fails (because the mutex count is > -1), so it now calls __mutex_unlock_slowpath(), which calls > __mutex_unlock_common_slowpath(). > > 7. __mutex_unlock_common_slowpath() sets the mutex count to 1 > unconditionally, before spinning on mutex->wait_lock. > > 8. Task B, in __mutex_trylock_slowpath, continues: > > /* Set it back to 0 if there are no waiters: */ > if (likely(list_empty(&lock->wait_list))) > atomic_set(&lock->count, 0); > > spin_unlock_mutex(&lock->wait_lock, flags); > > return prev == 1; > > mutex->wait_list is still empty, so the code sets the mutex count to > zero (which means the mutex is locked), releases mutex->wait_lock, > and returns 0 (which means that the mutex is locked by someone else, > and cannot be acquired). > > 9. Task C, in __mutex_unlock_common_slowpath, acquires > mutex->wait_lock, unlocks it immediately (because there are no > waiters to wake up) and returns. > > The end result is that the mutex count is 0 (locked), although the > owner has just released it, and nobody else is holding the mutex. So it > can no longer be acquired by anyone. > > Am I missing something that prevents the above scenario from happening? > If not, should I post a patch that fixes it to LKML? Or is it > considered too "theoretical" and cannot happen in practice? > I haven't looked at your explanations, you should have come with a reproductible test case to demonstrate the issue (involving slowing down one CPU ?). Anyway, such deep knowledge on the mutex implementation has to be found on lkml. Regards. -- Yann Droneaud OPTEYA _______________________________________________ Kernelnewbies mailing list Kernelnewbies@xxxxxxxxxxxxxxxxx http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies