On Wed, 15 Jan 2025 15:37:57 +0200 Claudiu Beznea <claudiu.beznea@xxxxxxxxx> wrote: > Hi, Jonathan, > > Thank you for your input! > > On 11.01.2025 15:14, Jonathan Cameron wrote: > > On Mon, 6 Jan 2025 11:18:41 +0200 > > Claudiu Beznea <claudiu.beznea@xxxxxxxxx> wrote: > > > >> Hi, Jonathan, > >> > >> > >> On 04.01.2025 15:52, Jonathan Cameron wrote: > >>> On Fri, 3 Jan 2025 16:00:41 +0200 > >>> Claudiu <claudiu.beznea@xxxxxxxxx> wrote: > >>> > >>>> From: Claudiu Beznea <claudiu.beznea.uj@xxxxxxxxxxxxxx> > >>> +CC Rafael and linux-pm > >>> > >>>> > >>>> On all systems where the rzg2l_adc driver is used, the ADC clocks are part > >>>> of a PM domain. The code that implements the PM domains support is in > >>>> drivers/clk/renesas/rzg2l-cpg.c, the functions of interest for this commit > >>>> being rzg2l_cpg_attach_dev() and rzg2l_cpg_deattach_dev(). The PM > >>>> domains support is registered with GENPD_FLAG_PM_CLK which, according to > >>>> the documentation, instructs genpd to use the PM clk framework while > >>>> powering on/off attached devices. > >>>> > >>>> During probe, the ADC device is attached to the PM domain > >>>> controlling the ADC clocks. Similarly, during removal, the ADC device is > >>>> detached from the PM domain. > >>>> > >>>> The detachment call stack is as follows: > >>>> > >>>> device_driver_detach() -> > >>>> device_release_driver_internal() -> > >>>> __device_release_driver() -> > >>>> device_remove() -> > >>>> platform_remove() -> > >>>> dev_pm_domain_detach() > >>>> > >>>> During driver unbind, after the ADC device is detached from its PM domain, > >>>> the device_unbind_cleanup() function is called, which subsequently invokes > >>>> devres_release_all(). This function handles devres resource cleanup. > >>>> > >>>> If runtime PM is enabled via devm_pm_runtime_enable(), the cleanup process > >>>> triggers the action or reset function for disabling runtime PM. This > >>>> function is pm_runtime_disable_action(), which leads to the following call > >>>> stack of interest when called: > >>>> > >>>> pm_runtime_disable_action() -> > >>>> pm_runtime_dont_use_autosuspend() -> > >>> > >>> So is the only real difference that in the code below you disable runtime pm > >>> before autosuspend? > >> > >> No, the difference is that now, the driver specific runtime PM APIs are not > >> called anymore (through the pointed call stack) after the ADC was removed > >> from it's PM domain. > > > > By my reading they are only not called now because you turn autosuspend off > > after disabling runtime PM. > > Sorry, I wanted to say that the runtime PM APIs are not called anymore from > devm_action_release(), though this call stack: > > [ 24.801195] Call trace: > [ 24.803633] rzg2l_adc_pm_runtime_suspend+0x18/0x54 (P) > [ 24.808847] pm_generic_runtime_suspend+0x2c/0x44 (L) > [ 24.813887] pm_generic_runtime_suspend+0x2c/0x44 > [ 24.818580] __rpm_callback+0x48/0x198 > [ 24.822319] rpm_callback+0x68/0x74 > [ 24.825798] rpm_suspend+0x100/0x578 > [ 24.829362] rpm_idle+0xd0/0x17c > [ 24.832582] update_autosuspend+0x30/0xc4 > [ 24.836580] pm_runtime_disable_action+0x40/0x64 > [ 24.841184] devm_action_release+0x14/0x20 > [ 24.845274] devres_release_all+0xa0/0x100 > [ 24.849361] device_unbind_cleanup+0x18/0x60 > > This is because I dropped the devm_pm_runtime_enable() which registers the > pm_runtime_disable_action(), which is called at the time the > device_unbind_cleanup() is called, which is called when the ADC is not > anymore part of its PM domain. > > If I change the order in remove function proposed in this patch, thus do: > > +static void rzg2l_adc_remove(struct platform_device *pdev) > +{ > + struct device *dev = &pdev->dev; > + > + pm_runtime_dont_use_autosuspend(dev); > + pm_runtime_disable(dev); > } > > nothing changes with the behavior of this patch. There will be no issue if > the device is runtime suspended/resumed through the > pm_runtime_dont_use_autosuspend() because at the time the > rzg2l_adc_remove() is called the ADC is still part of the PM domain. > > > > > > >> > >> > >>> Can you still do that with a devm callback just not > >>> the standard one? > >> > >> No. It doesn't matter if we call the standard devm callback or driver > >> specific one. As long as it is devm it will impact us as long as the driver > >> specific runtime PM APIs are called through devres_release_all() after > >> dev_pm_domain_detach(). And at that time the PM domain may be off along > >> with its clocks disabled. > > > > As above, I think that this is only the case because of the reordering > > of those two calls, not something more fundamental. > > I tried having a local devm function (the following diff applied with this > patch reverted) identical with pm_runtime_disable_action(): > > diff --git a/drivers/iio/adc/rzg2l_adc.c b/drivers/iio/adc/rzg2l_adc.c > index 22a581c894f8..459cc9c67eec 100644 > --- a/drivers/iio/adc/rzg2l_adc.c > +++ b/drivers/iio/adc/rzg2l_adc.c > @@ -423,6 +423,12 @@ static int rzg2l_adc_hw_init(struct device *dev, > struct rzg2l_adc *adc) > return ret; > } > > +static void rzg2l_pm_runtime_disable(void *data) > +{ > + pm_runtime_dont_use_autosuspend(data); > + pm_runtime_disable(data); > +} > + > static int rzg2l_adc_probe(struct platform_device *pdev) > { > struct device *dev = &pdev->dev; > @@ -463,7 +469,9 @@ static int rzg2l_adc_probe(struct platform_device *pdev) > > pm_runtime_set_autosuspend_delay(dev, 300); > pm_runtime_use_autosuspend(dev); > - ret = devm_pm_runtime_enable(dev); > + pm_runtime_enable(dev); > + > + ret = devm_add_action_or_reset(dev, rzg2l_pm_runtime_disable, dev); > if (ret) > return ret; > > With this the issue is still reproducible. > > However, changing the order of functions in rzg2l_pm_runtime_disable() and > having it like: > > +static void rzg2l_pm_runtime_disable(void *data) > +{ > + pm_runtime_disable(data); > + pm_runtime_dont_use_autosuspend(data); > +} > > > leads to no failure when doing unbind/bind. > > However, I see the pm_runtime_disable() can still call rpm_resume() under > certain conditions. It can still lead to failures if it is called after the > device was remove from its PM domain. > > > > > In driver remove flow, device_unbind_cleanup9() is called > > just after device_remove() which is calling the dev->driver_remove() > > callback. There are no runtime pm related calls in between that I can see. > > On my side the device_remove() is calling dev->bus->remove() which is > platform_remove(), which calls the dev_pm_domain_detach(). The > dev_pm_domain_detach() detaches the ADC from it's PM domain. Because of > this, accessing now the ADC registers after a runtime resume leads to > failures pointed in this patch (as of my investigation) (as the ADC is not > anymore part of its PM domain and its PM domain is not started anymore > though runtime PM APIs). > > A similar issue was found while I was adding thermal support for RZ/G3S, > explained in > https://lore.kernel.org/all/20250103163805.1775705-3-claudiu.beznea.uj@xxxxxxxxxxxxxx > > > Jonathan, Rafael, Ulf, all, > > Do consider OK to change the order in pm_runtime_disable_action() to get > rid of these issues, e.g.: > > diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c > index 2ee45841486b..f27d311d2619 100644 > --- a/drivers/base/power/runtime.c > +++ b/drivers/base/power/runtime.c > @@ -1547,8 +1547,8 @@ EXPORT_SYMBOL_GPL(pm_runtime_enable); > > static void pm_runtime_disable_action(void *data) > { > - pm_runtime_dont_use_autosuspend(data); > pm_runtime_disable(data); > + pm_runtime_dont_use_autosuspend(data); > } > > though I see a rpm_resume() call is still possible though pm_runtime_disable(). This isn't the right fix. I was just trying to get to the bottom of why your fix worked and reordering was a false path. If we go ahead with this patch, then put them in the same order as in pm_runtime_disable_action() and add a nice big comment on why we have to do them manually. Now you've talked me through the call of platform_remove() I can see the dev_pm_domain_detach(). It seems odd that is called before the devres manged part of device tear down. The change here to me smells like a hack to get around that and looks like a bad idea from a maintenance point of view. Rafael/all, right approach or should we do something else? Jonathan > > Thank you, > Claudiu > > > > > So you are moving these calls a little earlier but not in a fashion that > > seems to have any involvement with anything else. > > > > > > Call stack being > > device_release_driver_internal() > > -> __device_release_driver() > > -> device_remove() where you are now calling the disables > > .. some dma stuff > > -> device_unbind_cleanup() where the calling of disables previously was. > > > > other than that unrelated DMA stuff there is nothing between the > > two calls. > > > > There is runtime PM stuff before any of this, but not in the crucial > > portion this code affects. > > > > So I am thinking the only change that actually matters is the trivial > > reorder mentioned above. > > > > > > > > > > > >> > >>> > >>> > >>>> __pm_runtime_use_autosuspend() -> > >>>> update_autosuspend() -> > >>>> rpm_idle() > >>>> > >>>> The rpm_idle() function attempts to runtime resume the ADC device. > >>> > >>> Can you give a little more on that path. I'm not immediately spotting > >>> how rpm_idle() is causing a resume > >> > >> It is not in particular about the resume. Runtime suspend/resume after > >> devres_release_all() will be affected. > >> > >> In particular, the rpm_idle() can call rpm_suspend() (called on the out > >> label of rpm_idle()) and rpm_suspend() may call the driver specific > >> runtime_suspend API through the following code (from the rpm_suspend() > >> function): > >> > >> callback = RPM_GET_CALLBACK(dev, runtime_suspend); > >> > >> > >> > >> dev_pm_enable_wake_irq_check(dev, true); > >> > >> retval = rpm_callback(callback, dev); > >> > >> if (retval) > >> > >> goto fail; > >> > >> > >> > >> The full stack generated when running: > >> # cd /sys/bus/platform/drivers/rzg2l-adc > >> # while :; do echo 10058000.adc > unbind ; echo 10058000.adc > bind; done > >> > >> is as follows: > >> > >> [ 24.801195] Call trace: > >> [ 24.803633] rzg2l_adc_pm_runtime_suspend+0x18/0x54 (P) > >> [ 24.808847] pm_generic_runtime_suspend+0x2c/0x44 (L) > >> [ 24.813887] pm_generic_runtime_suspend+0x2c/0x44 > >> [ 24.818580] __rpm_callback+0x48/0x198 > >> [ 24.822319] rpm_callback+0x68/0x74 > >> [ 24.825798] rpm_suspend+0x100/0x578 > >> [ 24.829362] rpm_idle+0xd0/0x17c > >> [ 24.832582] update_autosuspend+0x30/0xc4 > >> [ 24.836580] pm_runtime_disable_action+0x40/0x64 > >> [ 24.841184] devm_action_release+0x14/0x20 > >> [ 24.845274] devres_release_all+0xa0/0x100 > >> [ 24.849361] device_unbind_cleanup+0x18/0x60 > >> [ 24.853618] device_release_driver_internal+0x1ec/0x228 > >> [ 24.858828] device_driver_detach+0x18/0x24 > >> [ 24.862998] unbind_store+0xb4/0xb8 > >> [ 24.866478] drv_attr_store+0x24/0x38 > >> [ 24.870135] sysfs_kf_write+0x44/0x54 > >> [ 24.873795] kernfs_fop_write_iter+0x118/0x1a8 > >> [ 24.878229] vfs_write+0x2ac/0x358 > >> [ 24.881627] ksys_write+0x68/0xfc > >> [ 24.884934] __arm64_sys_write+0x1c/0x28 > >> [ 24.888846] invoke_syscall+0x48/0x110 > >> [ 24.892592] el0_svc_common.constprop.0+0xc0/0xe0 > >> [ 24.897285] do_el0_svc+0x1c/0x28 > >> [ 24.900593] el0_svc+0x30/0xd0 > >> [ 24.903647] el0t_64_sync_handler+0xc8/0xcc > >> [ 24.907821] el0t_64_sync+0x198/0x19c > >> [ 24.911481] Code: 910003fd f9403c00 f941bc01 f9400020 (b9400000) > > > > Thanks, that was helpful. > > > >> > >> > >> Digging it further today: on the Renesas RZ/G3S we implement the power > >> domain on/off and we register the PM domain with GENPD_FLAG_PM_CLK. The > >> on/off PM domain functionality is implemented though the clock controller > >> MSTOP functionality which blocks the bus access to the specific IP (in this > >> particular case to the ADC). > >> > >> The issue is reproducible when doing: > >> # cd /sys/bus/platform/drivers/rzg2l-adc > >> # while :; do echo 10058000.adc > unbind ; echo 10058000.adc > bind; done > >> > >> I noticed today that doing single manual unbind+bind doesn't always leads > >> to aborts. It may be related to the fact that, as I noticed, the genpd > >> power off is called asynchronously as a work (through > >> genpd_power_off_work_fn()). > >> > >> I also noticed today what when there is no on/off functionality implemented > >> on the PM domain we have no failures (as the MSTOP is not implemented and > >> the bus access to the specific IP is not blocked as there is no PM domain > >> on/off available). > > > > The PM domain stuff is only called later in device_unbind_cleanup() > > so I don't see the relevance. All the code you are modifying is > > done before that happens. > > > > Jonathan > > > > > >> > >> > >> > >>> > >>>> However, > >>>> at the point it is called, the ADC device is no longer part of the PM > >>>> domain (which manages the ADC clocks). Since the rzg2l_adc runtime PM > >>>> APIs directly modifies hardware registers, the > >>>> rzg2l_adc_pm_runtime_resume() function is invoked without the ADC clocks > >>>> being enabled. This is because the PM domain no longer resumes along with > >>>> the ADC device. As a result, this leads to system aborts. > >>>> > >>>> Drop the devres API for runtime PM enable. > >>>> > >>>> Fixes: 89ee8174e8c8 ("iio: adc: rzg2l_adc: Simplify the runtime PM code") > >>>> Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@xxxxxxxxxxxxxx> > >>> > >>> See below. I'm doubtful in general about the sequence changes and > >>> specifically you can't just remove one devm callback from a driver without > >>> modifying a lot of other code / leaving really fragile ordering. > >>> > >>> Jonathan > >>> > >>>> --- > >>>> drivers/iio/adc/rzg2l_adc.c | 33 ++++++++++++++++++++++++--------- > >>>> 1 file changed, 24 insertions(+), 9 deletions(-) > >>>> > >>>> diff --git a/drivers/iio/adc/rzg2l_adc.c b/drivers/iio/adc/rzg2l_adc.c > >>>> index 883c167c0670..f12f3daf08cc 100644 > >>>> --- a/drivers/iio/adc/rzg2l_adc.c > >>>> +++ b/drivers/iio/adc/rzg2l_adc.c > >>>> @@ -464,25 +464,26 @@ static int rzg2l_adc_probe(struct platform_device *pdev) > >>>> > >>>> pm_runtime_set_autosuspend_delay(dev, 300); > >>>> pm_runtime_use_autosuspend(dev); > >>>> - ret = devm_pm_runtime_enable(dev); > >>>> - if (ret) > >>>> - return ret; > >>>> + pm_runtime_enable(dev); > >>>> > >>>> platform_set_drvdata(pdev, indio_dev); > >>>> > >>>> ret = rzg2l_adc_hw_init(dev, adc); > >>>> - if (ret) > >>>> - return dev_err_probe(&pdev->dev, ret, > >>>> - "failed to initialize ADC HW\n"); > >>>> + if (ret) { > >>>> + dev_err_probe(&pdev->dev, ret, "failed to initialize ADC HW\n"); > >>>> + goto rpm_disable; > >>>> + } > >>>> > >>>> irq = platform_get_irq(pdev, 0); > >>>> - if (irq < 0) > >>>> - return irq; > >>>> + if (irq < 0) { > >>>> + ret = irq; > >>>> + goto rpm_disable; > >>>> + } > >>>> > >>>> ret = devm_request_irq(dev, irq, rzg2l_adc_isr, > >>>> 0, dev_name(dev), adc); > >>>> if (ret < 0) > >>>> - return ret; > >>>> + goto rpm_disable; > >>>> > >>>> init_completion(&adc->completion); > >>>> > >>>> @@ -493,6 +494,19 @@ static int rzg2l_adc_probe(struct platform_device *pdev) > >>>> indio_dev->num_channels = adc->data->num_channels; > >>>> > >>>> return devm_iio_device_register(dev, indio_dev); > >>>> + > >>>> +rpm_disable: > >>>> + pm_runtime_disable(dev); > >>>> + pm_runtime_dont_use_autosuspend(dev); > >>>> + return ret; > >>> If you have to move away from devm you must do it for all calls after > >>> the first thing that is manually cleaned up. > >>> As you have it here the userspace interfaces are left available at a point > >>> well after power down. > >> > >> I see, thank you for pointing it. > >> > >> And thank you for checking this, > >> Claudiu > >> > >>> > >>>> +} > >>>> + > >>>> +static void rzg2l_adc_remove(struct platform_device *pdev) > >>>> +{ > >>>> + struct device *dev = &pdev->dev; > >>>> + > >>>> + pm_runtime_disable(dev); > >>>> + pm_runtime_dont_use_autosuspend(dev); > >>>> } > >>>> > >>>> static const struct rzg2l_adc_hw_params rzg2l_hw_params = { > >>>> @@ -614,6 +628,7 @@ static const struct dev_pm_ops rzg2l_adc_pm_ops = { > >>>> > >>>> static struct platform_driver rzg2l_adc_driver = { > >>>> .probe = rzg2l_adc_probe, > >>>> + .remove = rzg2l_adc_remove, > >>>> .driver = { > >>>> .name = DRIVER_NAME, > >>>> .of_match_table = rzg2l_adc_match, > >>> > >> > >> > > > > >