On Thu, Jun 6, 2024 at 4:50 PM Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> wrote: > > On 06/06/2024 16:18, Rafael J. Wysocki wrote: > > On Thu, Jun 6, 2024 at 3:42 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote: > >> > >> On Thu, Jun 6, 2024 at 3:07 PM Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> wrote: > >>> > >>> On 05/06/2024 21:17, Rafael J. Wysocki wrote: > >>>> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx> > >>>> > >>>> It is reported that commit 31a0fa0019b0 ("thermal/debugfs: Pass cooling > >>>> device state to thermal_debug_cdev_add()") causes the ACPI fan driver > >>>> to fail probing on some systems which turns out to be due to the _FST > >>>> control method returning an invalid value until _FSL is first evaluated > >>>> for the given fan. If this happens, the .get_cur_state() cooling device > >>>> callback returns an error and __thermal_cooling_device_register() fails > >>>> as uses that callback after commit 31a0fa0019b0. > >>>> > >>>> Arguably, _FST should not return an inavlid value even if it is > >>>> evaluated before _FSL, so this may be regarded as a platform firmware > >>>> issue, but at the same time it is not a good enough reason for failing > >>>> the cooling device registration where the initial cooling device state > >>>> is only needed to initialize a thermal debug facility. > >>>> > >>>> Accordingly, modify __thermal_cooling_device_register() to pass a > >>>> negative state value to thermal_debug_cdev_add() instead of failing > >>>> if the initial .get_cur_state() callback invocation fails and adjust > >>>> the thermal debug code to ignore negative cooling device state values. > >>>> > >>>> Fixes: 31a0fa0019b0 ("thermal/debugfs: Pass cooling device state to thermal_debug_cdev_add()") > >>>> Closes: https://lore.kernel.org/linux-acpi/20240530153727.843378-1-laura.nao@xxxxxxxxxxxxx > >>>> Reported-by: Laura Nao <laura.nao@xxxxxxxxxxxxx> > >>>> Tested-by: Laura Nao <laura.nao@xxxxxxxxxxxxx> > >>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx> > >>> > >>> As it is a driver issue, it should be fixed in the driver, not in the > >>> core code. The resulting code logic in the core is trying to deal with > >>> bad driver behavior, it does not really seem appropriate. > > > > Besides, I don't quite agree with dismissing it as a driver issue. If > > a driver cannot determine the cooling device state, it should not be > > required to make it up. > > > > Because .get_cur_state() is specifically designed to be able to return > > an error, the core should be prepared to deal with errors returned by > > it and propagating the error is not always the best choice, like in > > this particular case. > > > >>> The core code has been clean up from the high friction it had with the > >>> legacy ACPI code. It would be nice to continue it this direction. > > > > This isn't really ACPI specific. Any driver can return an error from > > .get_cur_state() if it has a good enough reason. > > We are talking about registration time, right? If the driver is > registering too soon, eg. the firmware is not ready, should it fix the > moment it is registering the cooling device when it is sure the firmware > completed its initialization ? OK, so arguably the driver could set the initial state of the cooling device to 0. That may or may not be the right thing to do depending on the thermal state of the system at the moment. Then it would need to wait for the governor to pick up a more suitable state for it or leave it at 0. This could address the particular case at hand. However, should the core fail the cooling device registration if it gets an error from .get_cur_state() to start with? It didn't do that before. Indeed, it didn't even call .get_cur_state() then in the first place. Moreover, the current state of the cooling device is not even needed to register it except for the initialization of the debug code for that cooling device, so why fail the registration of it?