On 21/02/2022 15:43, Guenter Roeck wrote:
...
We observed a random null pointer deference crash somewhere in the
thermal core (crash log below is not very helpful) when calling
mutex_lock(). It looks like we get an interrupt when this crash
happens.
Looking at the lm90 driver, per the above, I now see we are calling
hwmon_notify_event() from the lm90 interrupt handler. Looking at
hwmon_notify_event() I see that ...
hwmon_notify_event()
--> hwmon_thermal_notify()
--> thermal_zone_device_update()
--> update_temperature()
--> mutex_lock()
So although I don't completely understand the crash, it does seem
that we should not be calling hwmon_notify_event() from the
interrupt handler.
As mentioned separately, this is not the problem.
Yes I can see that now.
I think the problem may be that this is not a devicetree system
(or the lm90 devide does not have a devicetree node), but thermal
notification currently only works in such systems because the hwmon
subsystem uses the devicetree registration method. At the same time,
CONFIG_THERMAL_OF is obviously enabled. Unfortunately, the hwmon code
does not bail out in that situation due to another bug.
The platform I see this on does use device-tree and it does have a node
for the ti,tmp451 device which uses the lm90 device. This platform uses
the device-tree source
arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451 node
is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
Cheers
Jon
--
nvpublic