On 20.09.22 21:23, Venkat Ch wrote:
Hi Felix,
Following is the background of the problem, how I traced to
mutex_lock and why I propose rcu locks.
Issue:
On a 10Mbps upload / 50 Mbps download connection, the following issue reported.
Video periodically freezes and/or appears delayed when on Zoom or Teams.
1. Video will freeze for 10 or 15 seconds periodically when on a call,
but audio continues and the session doesn't drop.
2. The video still works but it appears delay (I move, but the video
of my movement is noticeably delay by a second or so)
Tracing to mutex_lock(sta_mutex):
When I investigated, I found that the ucentral agent in openwifi
fetches the station list periodically. Without the station list
fetch, the video quality is just fine. I investigated the station list
path and found this mutex_lock. I also see that earlier it was
rcu_lock which protected the station list in this path. In this
commit, https://github.com/torvalds/linux/commit/66572cfc30a4b764150c83ee5d842a3ce17991c9,
rcu lock was changed to mutex lock without providing any reason.
The reason seems clear to me, even though it was not explicitly stated
in the commit message: in sta_set_sinfo it introduces a call to a driver
op that is allowed to sleep.
I also saw this comment just above the sta_mutex declaration.
/* Station data */
/*
* The mutex only protects the list, hash table and
* counter, reads are done with RCU.
*/
struct mutex sta_mtx;
So I reverted back the mutex_lock with rcu_lock and it just worked
fine. We tested for more than 2 weeks before concluding this analysis.
I think the usage of mutex_lock is impacting the tx / rx path
somewhere and hence the issue. It is a challenge to trace the exact
function though.
I don't see any critical part in the tx/rx path which depends on the
sta_mtx lock. My guess is that for some reason your change is simply
accidentally papering over the real bug, which may be somewhere else
entirely, maybe even in the driver. A freeze for 10-15 seconds
definitely does not sound like simple lock contention on the mutex,
since a single station dump will be much faster than that.
- Felix