Hi, On Mon, Jun 22, 2020 at 10:22:48AM +0200, Krzysztof Kozlowski wrote: > On Fri, Jun 19, 2020 at 07:55:21PM +0200, Sebastian Reichel wrote: > > On Wed, May 27, 2020 at 09:42:54AM +0200, Pali Rohár wrote: > > > On Tuesday 26 May 2020 21:16:28 Andrew F. Davis wrote: > > > > On 5/25/20 7:32 AM, Krzysztof Kozlowski wrote: > > > > > This reverts commit 8cfaaa811894a3ae2d7360a15a6cfccff3ebc7db. > > > > > > > > > > If device was unbound and bound, the polling interval would be set to 0. > > > > > This is both unexpected and messes up with other bq27xxx devices (if > > > > > more than one battery device is used). > > > > > > > > > > This reset of polling interval was added in commit 8cfaaa811894 > > > > > ("bq27x00_battery: Fix OOPS caused by unregistring bq27x00 driver") > > > > > stating that power_supply_unregister() calls get_property(). However in > > > > > Linux kernel v3.1 and newer, such call trace does not exist. > > > > > Unregistering power supply does not call get_property() on unregistered > > > > > power supply. > > > > > > > > > > Fixes: 8cfaaa811894 ("bq27x00_battery: Fix OOPS caused by unregistring bq27x00 driver") > > > > > Cc: <stable@xxxxxxxxxxxxxxx> > > > > > Signed-off-by: Krzysztof Kozlowski <krzk@xxxxxxxxxx> > > > > > > > > > > --- > > > > > > > > > > I really could not identify the issue being fixed in offending commit > > > > > 8cfaaa811894 ("bq27x00_battery: Fix OOPS caused by unregistring bq27x00 > > > > > driver"), therefore maybe I missed here something important. > > > > > > > > > > Please share your thoughts on this. > > > > > > > > I'm having a hard time finding the OOPS also. Maybe there is a window > > > > where the poll function is running or about to run where > > > > cancel_delayed_work_sync() is called and cancels the work, only to have > > > > an interrupt or late get_property call in to the poll function and > > > > re-schedule it. > > > > > > > > What we really need is to do is look at how we are handling the polling > > > > function. It gets called from the workqueue, from a threaded interrupt > > > > context, and from a power supply framework callback, possibly all at the > > > > same time. Sometimes its protected by a lock, sometimes not. Updating > > > > the device's cached data should always be locked. > > > > > > > > What's more is the poll function is self-arming, so if we call > > > > cancel_delayed_work_sync() (remove it from the work queue then then wait > > > > for it to finish if running), are we sure it wont have just re-arm itself? > > > > > > > > We should make the only way we call the poll function be through the > > > > work queue, (plus make sure all accesses to the cache are locked). > > > > > > > > Andrew > > > > > > I do not remember details too. It is long time ago. > > > > > > CCing Ivaylo Dimitrov as he may remember something... > > > > Applying this revert introduces at least a race condition when > > userspace reads sysfs files while kernel removes the driver. > > > > So looking at the entrypoints for schedules: > > > > bq27xxx_battery_i2c_probe: > > Not relevant, probe is done when the battery is being removed. > > > > poll_interval_param_set: > > Can be avoided by unregistering from the list earlier. This > > is the right thing to do considering the battery is added to > > the list as last step in the probe routine, it should be removed > > first during teardown. > > Yes, good point. > > > bq27xxx_external_power_changed: > > This can happen at any time while the power-supply device is > > registered, because of the code in get_property. > > > > bq27xxx_battery_poll: > > This can happen at any time while the power-supply device is > > registered. > > > > As far as I can see the only thing in the delayed work needing > > the power-supply device is power_supply_changed(). If we add a > > check, that di->bat is not NULL, we should be able to reorder > > teardown like this: > > Except power_supply structure there is the device state struct > bq27xxx_device_info 'di'. If bq27xxx_battery_poll() is called > during the unbind, it will access the 'di' which is being freed by > devm-framework. And just checking for di->bat is also not thread > safe (can be reordered). > > I think there is no easy few-line fix for this. Instead, the > workqueue scheduling should be guarded everywhere by device-instance > mutex (bq27xxx_device_info.lock). Freeing of bq27xxx_device_info 'di' is not a problem, since the managed resource is attached to the i2c device, not the power-supply device. The I2C device is still available after step A2 from the list below, only the power-supply device will be unregistered/free'd. As a result after step A2 external power change can no longer happen and get_property can no longer be called, so no new work can be scheduled (apart from requeuing). After step A3 work incl. requeuing is stopped (cancel_delayed_work_sync handles work requeing itself). The alternative would be to introduce a flag in di, that the device is about to teardown and avoid rescheduling work from external_power_changed + get_property when that is set. Then it would be possible to teardown like this: B1. get mutex, set teardown flag, release mutex B2. remove from list B3. cancel delayed work B4. unregister power-supply device B5. destroy mutex > > A1. remove from list > > A2. unregister power-supply device and set to di->bat to NULL > > A3. cancel delayed work > > A4. destroy mutex > > > > Also I agree with Andrew, that the locking looks fishy. I think > > the lock needs to be moved, so that the call to > > bq27xx_battery_update(di) in bq27xxx_battery_poll is protected. > > Exactly. > > Best regards, > Krzysztof -- Sebastian
Attachment:
signature.asc
Description: PGP signature