On Tue, Nov 30, 2021 at 5:26 PM Li Zhijian <zhijianx.li@xxxxxxxxx> wrote: > > LKP/0Day found that ww_mutex.sh cannot complete since v5.16-rc1, but > I'm pretty sorry that we failed to bisect the FBC, instead, the bisection pointed > to a/below merge commit(91e1c99e17) finally. > > Due to this hang, other tests in the same group are also blocked in 0Day, we > hope we can fix this hang ASAP. > > So if you have any idea about this, or need more debug information, feel free to let me know :) > > BTW, ww_mutex.sh was failed in v5.15 without hang, and looks it cannot reproduce on a vm. > So, as part of the proxy-execution work, I've been recently trying to understand why the patch series was causing apparent hangs in the ww_mutex test with large(64) cpu counts. I was assuming my changes were causing a lost wakeup somehow, but as I dug in I found it looked like the stress_inorder_work() function was live-locking. I noticed that adding printks to the logic would change the behavior, and finally realized I could reproduce a livelock against mainline by adding a printk before the "return -EDEADLK;" in __ww_mutex_kill(), making it clear the logic was timing sensitive. Then searching around I found this old and unresolved thread. Part of the issue is that we may not hit the timeout check at the end of the loop, as the EDEADLK case short-cuts back to retry, allowing the test to effectively get stuck. But I know with ww_mutexes there's supposed to be a guarantee of forward progress as the older context wins, but it's not clear to me that works here. The EDEADLK case results in a releasing and reacquiring of the locks (only with the contended lock being taken first), and if a second EDEADLK occurs, it starts over again from scratch (though with the new contended lock being chosen first instead - which seems to lose any progress). So maybe the test has broken that guarentee in how it restarts, or with 128 threads trying to acquire a random order of 16 locks without contention (and the order shifting slightly each time it does see contention) it might just be a very big space to resolve if we don't luck into good timing. Anyway, I wanted to get some feedback from folks who have a better theoretical understanding of the ww_mutexes. With large cpu counts are we just asking for trouble here? Is the test doing something wrong? Or is there possibly a ww_mutex bug under this? thanks -john