On 2/21/22 07:49, Jon Hunter wrote:
On 21/02/2022 15:43, Guenter Roeck wrote:
...
We observed a random null pointer deference crash somewhere in the
thermal core (crash log below is not very helpful) when calling
mutex_lock(). It looks like we get an interrupt when this crash
happens.
Looking at the lm90 driver, per the above, I now see we are calling
hwmon_notify_event() from the lm90 interrupt handler. Looking at
hwmon_notify_event() I see that ...
hwmon_notify_event()
--> hwmon_thermal_notify()
--> thermal_zone_device_update()
--> update_temperature()
--> mutex_lock()
So although I don't completely understand the crash, it does seem
that we should not be calling hwmon_notify_event() from the
interrupt handler.
As mentioned separately, this is not the problem.
Yes I can see that now.
I think the problem may be that this is not a devicetree system
(or the lm90 devide does not have a devicetree node), but thermal
notification currently only works in such systems because the hwmon
subsystem uses the devicetree registration method. At the same time,
CONFIG_THERMAL_OF is obviously enabled. Unfortunately, the hwmon code
does not bail out in that situation due to another bug.
The platform I see this on does use device-tree and it does have a node for the ti,tmp451 device which uses the lm90 device. This platform uses the device-tree source arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451 node is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
Interesting. It appears that the call to devm_thermal_zone_of_sensor_register()
in the hwmon core nevertheless returns -ENODEV which is not handled properly
in the hwmon core. I can see a number of reasons for this to happen:
- there is no devicetree node for the lm90 device
- there is no thermal-zones devicetree node
- there is no thermal zone entry in the thermal-zones node which matches
the sensor
We'll have to revert the lm90 changes until this is sorted out.
Guenter