Hi,
W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze:
On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote:
Hi Daniel,
<snip>
I did reproduce:
v5.8-rc3 + series => imx6 hang at boot time
v5.8-rc3 => imx6 boots correctly
So finally I succeeded to reproduce it on my imx7 locally. The sensor
was failing to initialize for another reason related to the legacy
cooling device, this is why it is not appearing on the imx7.
I can now git-bisect :)
That would be very kind of you, thank you!
With the lock correctness option enabled:
[ 4.179223] imx_thermal tempmon: Extended Commercial CPU temperature
grade - max:105C critical:100C passive:95C
[ 4.189557]
[ 4.191060] ============================================
[ 4.196378] WARNING: possible recursive locking detected
[ 4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted
[ 4.207102] --------------------------------------------
[ 4.212421] kworker/0:3/54 is trying to acquire lock:
[ 4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
thermal_zone_device_is_enabled+0x18/0x34
[ 4.225777]
[ 4.225777] but task is already holding lock:
[ 4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
thermal_zone_get_temp+0x38/0x6c
[ 4.239121]
[ 4.239121] other info that might help us debug this:
[ 4.245655] Possible unsafe locking scenario:
[ 4.245655]
[ 4.251579] CPU0
[ 4.254031] ----
[ 4.256481] lock(&tz->lock);
[ 4.259544] lock(&tz->lock);
[ 4.262608]
[ 4.262608] *** DEADLOCK ***
[ 4.262608]
[ 4.268533] May be due to missing lock nesting notation
[ 4.268533]
[ 4.275329] 4 locks held by kworker/0:3/54:
[ 4.279517] #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, at:
process_one_work+0x224/0x808
[ 4.288241] #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, at:
process_one_work+0x224/0x808
[ 4.296787] #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at:
__device_attach+0x30/0x140
[ 4.304468] #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
thermal_zone_get_temp+0x38/0x6c
[ 4.312408]
[ 4.312408] stack backtrace:
[ 4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted
5.8.0-rc3-00011-gf5e50bf4d3ef #42
[ 4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree)
[ 4.330809] Workqueue: events deferred_probe_work_func
[ 4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>]
(show_stack+0x10/0x14)
[ 4.343734] [<c030c580>] (show_stack) from [<c079d7d8>]
(dump_stack+0xe8/0x114)
[ 4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>]
(__lock_acquire+0xbfc/0x2cb4)
[ 4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>]
(lock_acquire+0xf4/0x4e4)
[ 4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>]
(__mutex_lock+0xb0/0xaa8)
[ 4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>]
(mutex_lock_nested+0x1c/0x24)
[ 4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>]
(thermal_zone_device_is_enabled+0x18/0x34)
[ 4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from
[<c0d9da90>] (imx_get_temp+0x30/0x208)
[ 4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>]
(thermal_zone_get_temp+0x4c/0x6c)
[ 4.409640] [<c0d97484>] (thermal_zone_get_temp) from [<c0d93df0>]
(thermal_zone_device_update+0x8c/0x258)
[ 4.419310] [<c0d93df0>] (thermal_zone_device_update) from
[<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88)
[ 4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from
[<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578)
[ 4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>]
(platform_drv_probe+0x48/0x98)
[ 4.447622] [<c0a78388>] (platform_drv_probe) from [<c0a7603c>]
(really_probe+0x218/0x348)
[ 4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>]
(driver_probe_device+0x5c/0xb4)
[ 4.464098] [<c0a76278>] (driver_probe_device) from [<c0a743bc>]
(bus_for_each_drv+0x58/0xb8)
[ 4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>]
(__device_attach+0xd4/0x140)
[ 4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>]
(bus_probe_device+0x88/0x90)
[ 4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>]
(deferred_probe_work_func+0x68/0x98)
[ 4.498088] [<c0a75564>] (deferred_probe_work_func) from [<c0369988>]
(process_one_work+0x2e0/0x808)
[ 4.507237] [<c0369988>] (process_one_work) from [<c036a150>]
(worker_thread+0x2a0/0x59c)
[ 4.515432] [<c036a150>] (worker_thread) from [<c0372208>]
(kthread+0x16c/0x178)
[ 4.522843] [<c0372208>] (kthread) from [<c0300174>]
(ret_from_fork+0x14/0x20)
[ 4.530074] Exception stack(0xca075fb0 to 0xca075ff8)
[ 4.535138] 5fa0: 00000000
00000000 00000000 00000000
[ 4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[ 4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
Thanks!
That confirms your suspicions.
So the reason is that ->get_temp() is called while the mutex is held and
thermal_zone_device_is_enabled() wants to take the same mutex.
Is adding a comment to thermal_zone_device_is_enabled() to never call
it while the mutex is held and adding another version of it which does
not take the mutex ok?
Andrzej