Re: [PATCH 01/18] pwm: Provide devm_pwmchip_alloc() function

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 17, 2023 at 4:35 PM Saravana Kannan <saravanak@xxxxxxxxxx> wrote:
>
> On Sat, Oct 14, 2023 at 9:17 AM Uwe Kleine-König
> <u.kleine-koenig@xxxxxxxxxxxxxx> wrote:
> >
> > On Fri, Oct 13, 2023 at 02:42:20PM -0700, Saravana Kannan wrote:
> > > On Tue, Oct 10, 2023 at 1:05 AM Uwe Kleine-König
> > > <u.kleine-koenig@xxxxxxxxxxxxxx> wrote:
> > > >
> > > > Hello Saravana,
> > > >
> > > > you were pointed out to me as the expert for device links. I found a
> > > > problem with these.
> > > >
> > > > On Tue, Jul 25, 2023 at 11:10:04PM +0200, Uwe Kleine-König wrote:
> > > > > Today I managed to trigger the problem I intend to address with this
> > > > > series. My machine to test this on is an stm32mp157. To be able to
> > > > > trigger the problem reliably I applied the following patches on top of
> > > > > v6.5-rc1:
> > > > >
> > > > >  - pwm: stm32: Don't modify HW state in .remove() callback
> > > > >    This is a cleanup that I already sent out.
> > > > >    https://lore.kernel.org/r/20230713155142.2454010-2-u.kleine-koenig@xxxxxxxxxxxxxx
> > > > >    The purpose for reproducing the problem is to not trigger further
> > > > >    calls to the apply callback.
> > > > >
> > > > >  - The following patch:
> > > > >
> > > > > diff --git a/drivers/pwm/pwm-stm32.c b/drivers/pwm/pwm-stm32.c
> > > > > index 687967d3265f..c7fc02b0fa3c 100644
> > > > > --- a/drivers/pwm/pwm-stm32.c
> > > > > +++ b/drivers/pwm/pwm-stm32.c
> > > > > @@ -451,6 +451,10 @@ static int stm32_pwm_apply(struct pwm_chip *chip, struct pwm_device *pwm,
> > > > >       struct stm32_pwm *priv = to_stm32_pwm_dev(chip);
> > > > >       int ret;
> > > > >
> > > > > +     dev_info(chip->dev, "%s:%d\n", __func__, __LINE__);
> > > > > +     msleep(5000);
> > > > > +     dev_info(chip->dev, "%s:%d\n", __func__, __LINE__);
> > > > > +
> > > > >       enabled = pwm->state.enabled;
> > > > >
> > > > >       if (enabled && !state->enabled) {
> > > > > @@ -650,7 +654,11 @@ static void stm32_pwm_remove(struct platform_device *pdev)
> > > > >  {
> > > > >       struct stm32_pwm *priv = platform_get_drvdata(pdev);
> > > > >
> > > > > +     dev_info(&pdev->dev, "%s:%d\n", __func__, __LINE__);
> > > > >       pwmchip_remove(&priv->chip);
> > > > > +     dev_info(&pdev->dev, "%s:%d\n", __func__, __LINE__);
> > > > > +
> > > > > +     priv->regmap = NULL;
> > > > >  }
> > > > >
> > > > >  static int __maybe_unused stm32_pwm_suspend(struct device *dev)
> > > > >
> > > > > The first hunk is only there to widen the race window. The second is to
> > > > > give some diagnostics and make stm32_pwm_apply() crash if it continues
> > > > > to run after the msleep. (Without it it didn't crash reproducibly, don't
> > > > > understand why. *shrug*)
> > > > >
> > > > > The device tree contains a pwm-fan device making use of one of the PWMs.
> > > > >
> > > > > Now I do the following:
> > > > >
> > > > >       echo fan > /sys/bus/platform/drivers/pwm-fan/unbind & sleep 1; echo 40007000.timer:pwm > /sys/bus/platform/drivers/stm32-pwm/unbind
> > > > >
> > > > > Unbinding the fan device has two effects:
> > > > >
> > > > >  - The device link between fan and pwm looses its property to unbind fan
> > > > >    when pwm gets unbound.
> > > > >    (Its .status changes from DL_STATE_ACTIVE to DL_STATE_AVAILABLE)
> > > > >  - It calls pwm_fan_cleanup() which triggers a call to
> > > > >    pwm_apply_state().
> > > > >
> > > > > So when the pwm device gets unbound the first thread is sleeping in
> > > > > stm32_pwm_apply(). The driver calls pwmchip_remove() and sets
> > > > > priv->regmap to NULL. Then a few seconds later the first thread wakes up
> > > > > in stm32_pwm_apply() with the chip freed and priv->regmap = NULL. Bang!
> > > > >
> > > > > This looks as follows:
> > > > >
> > > > > root@crown:~# echo fan > /sys/bus/platform/drivers/pwm-fan/unbind & sleep 1; echo 40007000.timer:pwm > /sys/bus/platform/drivers/stm32-pwm/unbind
> > > > > [  187.182113] stm32-pwm 40007000.timer:pwm: stm32_pwm_apply:454
> > > > > [  188.164769] stm32-pwm 40007000.timer:pwm: stm32_pwm_remove:657
> > > > > [  188.184555] stm32-pwm 40007000.timer:pwm: stm32_pwm_remove:659
> > > > > root@crown:~# [  192.236423] platform 40007000.timer:pwm: stm32_pwm_apply:456
> > > > > [  192.240727] 8<--- cut here ---
> > > > > [  192.243759] Unable to handle kernel NULL pointer dereference at virtual address 0000001c when read
> > > > > ...
> > > > >
> > > > > Even without the crash you can see that stm32_pwm_apply() is still
> > > > > running after pwmchip_remove() completed.
> > > > >
> > > > > I'm unsure if the device link could be improved here to ensure that the
> > > > > fan is completely unbound even if it started unbinding already before
> > > > > the pwm device gets unbound. (And if it could, would this fit the device
> > > > > links purpose and so be a sensible improvement?)
> > > >
> > > > While I think that there is something to be done in the pwm core that
> > > > this doesn't explode (i.e. do proper lifetime tracking such that a
> > > > pwm_chip doesn't disappear while still being used---and I'm working on
> > > > that) I expected that the device links between pwm consumer and provider
> > > > would prevent the above described oops, too. But somehow the fan already
> > > > going away (but still using the PWM) when the PWM is unbound, results in
> > > > the PWM disappearing before the fan is completely gone.
> > > >
> > > > Is this expected, or a problem that can (and should?) be fixed?
> > >
> > > I didn't read your full series, but I read this email. With what's in
> > > this email, the problem seems to be in the driver or the pwm
> > > framework. The pwm driver/framework can't tell the driver core that
> > > you successfully unbound (returning from .remove()) before you have
> > > finish all your ongoing transactions with the device. If your
> > > "apply()" is still running, you need to make sure it's complete before
> > > .remove() does any resource releasing/clean up.
> > >
> > > Also, how is the consumer driver's .remove() succeeding if it has an
> > > ongoing pwm call()?
> >
> > The thing that works fine and as expected is:
> >
> >  - trigger unbind of PWM device via sysfs
> >
> > Because there is a device link PWM provider -> pwm consumer (fan), the
> > fan is removed and once its gone (and not earlier), the PWM gets unbound.
> >
> > The failing sequence is:
> >
> >  - trigger unbind of fan device in userspace thread A via sysfs. The
> >    fan's remove callback blocks for 5s in pwm_apply_state() and so
> >    .remove() doesn't complete yet.
> >
> >  - a second later: trigger unbind of PWM device via sysfs in thread B.
> >    As before I'd expect that the device link results in waiting for the
> >    fan to be removed completely, but the PWM is removed immediately.
> >
> >  - pwm_apply_state's sleep completes (in thread B) and operates on freed
> >    resources => bang!
> >
> > > This all sounds like insufficient locking and
> > > critical region protection in both the consumer and supplier.
> >
> > My (and I think also Thierry's) expectation was, that the device link
> > provides the needed synchronisation. But it doesn't as it doesn't block
> > the PWM provider going away until the fan is completely gone.
> >
> > > Device links can't do anything here because you are giving it wrong
> > > info -- that the unbind was successful before it actually is.
> >
> > The fan's unbind is ongoing, but not complete yet and I'd expect that
> > the device link blocks unbinding the PWM until the fan is completely
> > gone. So I think there is no wrong information.
> >
> > > Device links will and can make sure that the consumer is unbound
> > > successfully before the unbind is called on the supplier. And it looks
> > > like that's still true here.
> >
> > I hope you understood the situation better now and see the problem we
> > have.
> >
> > The problem is fixable in the pwm framework (and I'm working on that),
> > but I think there is also something to improve around devicelink
> > handling.
>
> Thanks for a better explanation of the issue. I agree, this seems like
> something device links should be able to take care of.
>
> I'll take a look into this.

Took me a while to debug this because I couldn't find the .remove()
function and I was very confused about what's going on.
I'm guessing you started hitting this issue only after moving to the
devm_ variant of the pwm APIs.

I think I have a fix.
https://lore.kernel.org/lkml/20231018013851.3303928-1-saravanak@xxxxxxxxxx/T/#u

I didn't even compile test it, but it's a trivial change. Can you
please test it out and give your "tested-by" if it fixes your issue?

Thanks,
Saravana





[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux