On 9/3/21 18:09, Pavel Skripkin wrote:
On 9/2/21 12:32, Greg Kroah-Hartman wrote:
On Sat, Aug 28, 2021 at 01:36:56PM +0200, Fabio M. De Francesco wrote:
Remove _enter_critical_mutex() and _exit_critical_mutex(). They are
unnecessary wrappers, respectively to mutex_lock_interruptible() and
to mutex_unlock(). They also have an odd interface that takes an unused
argument named pirqL of type unsigned long.
The original code enters the critical section if the mutex API is
interrupted while waiting to acquire the lock; therefore it could lead
to a race condition. Use mutex_lock() because it is uninterruptible and
so avoid that above-mentioned potential race condition.
Tested-by: Pavel Skripkin <paskripkin@xxxxxxxxx>
Reviewed-by: Pavel Skripkin <paskripkin@xxxxxxxxx>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@xxxxxxxxx>
---
v5: Fix a typo in the subject line. Reported by Aakash Hemadri.
v4: Tested and reviewed by Pavel Skripkin. No changes to the code.
v3: Assume that the original authors don't expect that
mutex_lock_interruptible() can be really interrupted and then lead to
a potential race condition. Furthermore, Greg Kroah-Hartman makes me
notice that "[] one almost never needs interruptable locks in a driver".
Therefore, replace the calls to mutex_lock_interruptible() with calls to
mutex_lock() since the latter is uninterruptible and avoid race
conditions without the necessity to handle -EINTR errors.
Based on a recent conversation on the linux-usb mailing list, perhaps I
was wrong:
https://lore.kernel.org/r/20210829015825.GA297712@xxxxxxxxxxxxxxxxxxx
Can you check what happens with your change when you disconnect the
device and these code paths are being called? That is when you do want
the lock interrupted.
Yes, the logic still seems wrong, but I don't want to see the code now
just lock up entirely with this change as it is a change in how things
work from today.
Hi, Greg!
I've retested this patch with lockdep enabled and I actually hit a
deadlock. It's really my fault to forgot about lockdep while testing v4,
I am sorry about the situation.
Actually, the disconnect here is not the problem, the problem was in
original code. Changing mutex_lock_interruptible to mutex_lock just
helped to discover it.
The log:
[ 252.063305] WARNING: possible recursive locking detected
[ 252.063642] 5.14.0+ #9 Tainted: G C
[ 252.063946] --------------------------------------------
[ 252.064282] ip/335 is trying to acquire lock:
[ 252.064560] ffff888009ebad28 (pmutex){+.+.}-{4:4}, at:
usbctrl_vendorreq+0xc5/0x4a0 [r8188eu]
[ 252.065168]
[ 252.065168] but task is already holding lock:
[ 252.065536] ffffffffc021b3b8 (pmutex){+.+.}-{4:4}, at:
netdev_open+0x3a/0x5f [r8188eu]
[ 252.066085]
[ 252.066085] other info that might help us debug this:
[ 252.066494] Possible unsafe locking scenario:
[ 252.066494]
[ 252.066866] CPU0
[ 252.067025] ----
[ 252.067184] lock(pmutex);
[ 252.067367] lock(pmutex);
[ 252.067548]
[ 252.067548] *** DEADLOCK ***
[ 252.067548]
[ 252.067920] May be due to missing lock nesting notation
[ 252.067920]
[ 252.068346] 2 locks held by ip/335:
[ 252.068570] #0: ffffffffbda94628 (rtnl_mutex){+.+.}-{4:4}, at:
rtnetlink_rcv_msg+0x1e0/0x660
[ 252.069115] #1: ffffffffc021b3b8 (pmutex){+.+.}-{4:4}, at:
netdev_open+0x3a/0x5f [r8188eu]
[ 252.069690]
[ 252.069690] stack backtrace:
[ 252.069968] CPU: 1 PID: 335 Comm: ip Tainted: G C
5.14.0+ #9
[ 252.071111] Call Trace:
[ 252.071273] dump_stack_lvl+0x45/0x59
[ 252.071513] __lock_acquire.cold+0x1fe/0x31b
[ 252.072709] lock_acquire+0x157/0x3c0
[ 252.074445] __mutex_lock+0xf6/0xc90
[ 252.076294] usbctrl_vendorreq+0xc5/0x4a0 [r8188eu]
[ 252.076651] usb_read8+0x68/0x8f [r8188eu]
[ 252.076962] ? usb_read16+0x8e/0x8e [r8188eu]
[ 252.077287] _rtw_read8+0x2d/0x32 [r8188eu]
[ 252.077601] HalPwrSeqCmdParsing+0x143/0x1de [r8188eu]
[ 252.077979] rtl8188eu_InitPowerOn+0x5a/0xe0 [r8188eu]
[ 252.078352] rtl8188eu_hal_init+0xe7/0x1008 [r8188eu]
[ 252.078989] rtw_hal_init+0x38/0xb5 [r8188eu]
[ 252.079317] _netdev_open+0x282/0x4db [r8188eu]
[ 252.079653] netdev_open+0x42/0x5f [r8188eu]
Ok, sorry for noise. It's 100% false positive. Why?
There is no pmutex in this driver. But! *All* mutexes are initialied via
private _rtw_mutex_init() API, which has a struct mutex *pmutex argument.
So, driver registers all mutexes with the same name to lockdep map. Of
course, lockdep will complain about _any_ nested locking...
I will prepare a patch to fix this *completely wrong* approach...
With regards,
Pavel Skripkin