On Wed, Oct 18, 2023 at 4:17 AM Uwe Kleine-König <u.kleine-koenig@xxxxxxxxxxxxxx> wrote: > > On Tue, Oct 17, 2023 at 06:42:40PM -0700, Saravana Kannan wrote: > > On Tue, Oct 17, 2023 at 4:35 PM Saravana Kannan <saravanak@xxxxxxxxxx> wrote: > > > > > > On Sat, Oct 14, 2023 at 9:17 AM Uwe Kleine-König > > > <u.kleine-koenig@xxxxxxxxxxxxxx> wrote: > > > > > > > > On Fri, Oct 13, 2023 at 02:42:20PM -0700, Saravana Kannan wrote: > > > > > On Tue, Oct 10, 2023 at 1:05 AM Uwe Kleine-König > > > > > <u.kleine-koenig@xxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > Hello Saravana, > > > > > > > > > > > > you were pointed out to me as the expert for device links. I found a > > > > > > problem with these. > > > > > > > > > > > > On Tue, Jul 25, 2023 at 11:10:04PM +0200, Uwe Kleine-König wrote: > > > > > > > Today I managed to trigger the problem I intend to address with this > > > > > > > series. My machine to test this on is an stm32mp157. To be able to > > > > > > > trigger the problem reliably I applied the following patches on top of > > > > > > > v6.5-rc1: > > > > > > > > > > > > > > - pwm: stm32: Don't modify HW state in .remove() callback > > > > > > > This is a cleanup that I already sent out. > > > > > > > https://lore.kernel.org/r/20230713155142.2454010-2-u.kleine-koenig@xxxxxxxxxxxxxx > > > > > > > The purpose for reproducing the problem is to not trigger further > > > > > > > calls to the apply callback. > > > > > > > > > > > > > > - The following patch: > > > > > > > > > > > > > > diff --git a/drivers/pwm/pwm-stm32.c b/drivers/pwm/pwm-stm32.c > > > > > > > index 687967d3265f..c7fc02b0fa3c 100644 > > > > > > > --- a/drivers/pwm/pwm-stm32.c > > > > > > > +++ b/drivers/pwm/pwm-stm32.c > > > > > > > @@ -451,6 +451,10 @@ static int stm32_pwm_apply(struct pwm_chip *chip, struct pwm_device *pwm, > > > > > > > struct stm32_pwm *priv = to_stm32_pwm_dev(chip); > > > > > > > int ret; > > > > > > > > > > > > > > + dev_info(chip->dev, "%s:%d\n", __func__, __LINE__); > > > > > > > + msleep(5000); > > > > > > > + dev_info(chip->dev, "%s:%d\n", __func__, __LINE__); > > > > > > > + > > > > > > > enabled = pwm->state.enabled; > > > > > > > > > > > > > > if (enabled && !state->enabled) { > > > > > > > @@ -650,7 +654,11 @@ static void stm32_pwm_remove(struct platform_device *pdev) > > > > > > > { > > > > > > > struct stm32_pwm *priv = platform_get_drvdata(pdev); > > > > > > > > > > > > > > + dev_info(&pdev->dev, "%s:%d\n", __func__, __LINE__); > > > > > > > pwmchip_remove(&priv->chip); > > > > > > > + dev_info(&pdev->dev, "%s:%d\n", __func__, __LINE__); > > > > > > > + > > > > > > > + priv->regmap = NULL; > > > > > > > } > > > > > > > > > > > > > > static int __maybe_unused stm32_pwm_suspend(struct device *dev) > > > > > > > > > > > > > > The first hunk is only there to widen the race window. The second is to > > > > > > > give some diagnostics and make stm32_pwm_apply() crash if it continues > > > > > > > to run after the msleep. (Without it it didn't crash reproducibly, don't > > > > > > > understand why. *shrug*) > > > > > > > > > > > > > > The device tree contains a pwm-fan device making use of one of the PWMs. > > > > > > > > > > > > > > Now I do the following: > > > > > > > > > > > > > > echo fan > /sys/bus/platform/drivers/pwm-fan/unbind & sleep 1; echo 40007000.timer:pwm > /sys/bus/platform/drivers/stm32-pwm/unbind > > > > > > > > > > > > > > Unbinding the fan device has two effects: > > > > > > > > > > > > > > - The device link between fan and pwm looses its property to unbind fan > > > > > > > when pwm gets unbound. > > > > > > > (Its .status changes from DL_STATE_ACTIVE to DL_STATE_AVAILABLE) > > > > > > > - It calls pwm_fan_cleanup() which triggers a call to > > > > > > > pwm_apply_state(). > > > > > > > > > > > > > > So when the pwm device gets unbound the first thread is sleeping in > > > > > > > stm32_pwm_apply(). The driver calls pwmchip_remove() and sets > > > > > > > priv->regmap to NULL. Then a few seconds later the first thread wakes up > > > > > > > in stm32_pwm_apply() with the chip freed and priv->regmap = NULL. Bang! > > > > > > > > > > > > > > This looks as follows: > > > > > > > > > > > > > > root@crown:~# echo fan > /sys/bus/platform/drivers/pwm-fan/unbind & sleep 1; echo 40007000.timer:pwm > /sys/bus/platform/drivers/stm32-pwm/unbind > > > > > > > [ 187.182113] stm32-pwm 40007000.timer:pwm: stm32_pwm_apply:454 > > > > > > > [ 188.164769] stm32-pwm 40007000.timer:pwm: stm32_pwm_remove:657 > > > > > > > [ 188.184555] stm32-pwm 40007000.timer:pwm: stm32_pwm_remove:659 > > > > > > > root@crown:~# [ 192.236423] platform 40007000.timer:pwm: stm32_pwm_apply:456 > > > > > > > [ 192.240727] 8<--- cut here --- > > > > > > > [ 192.243759] Unable to handle kernel NULL pointer dereference at virtual address 0000001c when read > > > > > > > ... > > > > > > > > > > > > > > Even without the crash you can see that stm32_pwm_apply() is still > > > > > > > running after pwmchip_remove() completed. > > > > > > > > > > > > > > I'm unsure if the device link could be improved here to ensure that the > > > > > > > fan is completely unbound even if it started unbinding already before > > > > > > > the pwm device gets unbound. (And if it could, would this fit the device > > > > > > > links purpose and so be a sensible improvement?) > > > > > > > > > > > > While I think that there is something to be done in the pwm core that > > > > > > this doesn't explode (i.e. do proper lifetime tracking such that a > > > > > > pwm_chip doesn't disappear while still being used---and I'm working on > > > > > > that) I expected that the device links between pwm consumer and provider > > > > > > would prevent the above described oops, too. But somehow the fan already > > > > > > going away (but still using the PWM) when the PWM is unbound, results in > > > > > > the PWM disappearing before the fan is completely gone. > > > > > > > > > > > > Is this expected, or a problem that can (and should?) be fixed? > > > > > > > > > > I didn't read your full series, but I read this email. With what's in > > > > > this email, the problem seems to be in the driver or the pwm > > > > > framework. The pwm driver/framework can't tell the driver core that > > > > > you successfully unbound (returning from .remove()) before you have > > > > > finish all your ongoing transactions with the device. If your > > > > > "apply()" is still running, you need to make sure it's complete before > > > > > .remove() does any resource releasing/clean up. > > > > > > > > > > Also, how is the consumer driver's .remove() succeeding if it has an > > > > > ongoing pwm call()? > > > > > > > > The thing that works fine and as expected is: > > > > > > > > - trigger unbind of PWM device via sysfs > > > > > > > > Because there is a device link PWM provider -> pwm consumer (fan), the > > > > fan is removed and once its gone (and not earlier), the PWM gets unbound. > > > > > > > > The failing sequence is: > > > > > > > > - trigger unbind of fan device in userspace thread A via sysfs. The > > > > fan's remove callback blocks for 5s in pwm_apply_state() and so > > > > .remove() doesn't complete yet. > > > > > > > > - a second later: trigger unbind of PWM device via sysfs in thread B. > > > > As before I'd expect that the device link results in waiting for the > > > > fan to be removed completely, but the PWM is removed immediately. > > > > > > > > - pwm_apply_state's sleep completes (in thread B) and operates on freed > > > > resources => bang! > > > > > > > > > This all sounds like insufficient locking and > > > > > critical region protection in both the consumer and supplier. > > > > > > > > My (and I think also Thierry's) expectation was, that the device link > > > > provides the needed synchronisation. But it doesn't as it doesn't block > > > > the PWM provider going away until the fan is completely gone. > > > > > > > > > Device links can't do anything here because you are giving it wrong > > > > > info -- that the unbind was successful before it actually is. > > > > > > > > The fan's unbind is ongoing, but not complete yet and I'd expect that > > > > the device link blocks unbinding the PWM until the fan is completely > > > > gone. So I think there is no wrong information. > > > > > > > > > Device links will and can make sure that the consumer is unbound > > > > > successfully before the unbind is called on the supplier. And it looks > > > > > like that's still true here. > > > > > > > > I hope you understood the situation better now and see the problem we > > > > have. > > > > > > > > The problem is fixable in the pwm framework (and I'm working on that), > > > > but I think there is also something to improve around devicelink > > > > handling. > > > > > > Thanks for a better explanation of the issue. I agree, this seems like > > > something device links should be able to take care of. > > > > > > I'll take a look into this. > > > > Took me a while to debug this because I couldn't find the .remove() > > function and I was very confused about what's going on. > > I'm guessing you started hitting this issue only after moving to the > > devm_ variant of the pwm APIs. > > Ah I see. That problem wouldn't happen if the fan called a pwm API > function in its remove callback but that happens in a devm cleanup call > (registered by devm_add_action_or_reset(dev, pwm_fan_cleanup, ctx) in > pwm_fan_probe()). I first thought you talked about > 8c89fd866ad221af037ef0ec3d60b83d0b859c65. Am I not talking about that commit? Btw, I'm still a bit confused by this thread. In your earlier emails to me, you said: > - trigger unbind of fan device in userspace thread A via sysfs. The > fan's remove callback blocks for 5s in pwm_apply_state() and so > .remove() doesn't complete yet. But the latest tree (Tot) didn't have any .remove() function at all. So I just decided to see if there's any issue in Tot and just fix that. I'm glad my fix helps fixed the issue with the used of devm_*() APIs. However, are you really seeing the issue (supplier freed before consumer) if you do the clean up in the .remove() function? If so, there might still be another issue that needs to be fixed. Thanks, Saravana