Helli Dmitry, On Tue, Oct 12, 2021 at 08:06:34PM -0700, Dmitry Torokhov wrote: > On Tue, Oct 12, 2021 at 09:39:59AM +0200, Uwe Kleine-König wrote: > > On Mon, Oct 11, 2021 at 06:48:51PM -0700, Dmitry Torokhov wrote: > > > On Tue, Oct 05, 2021 at 10:00:05PM +0200, Uwe Kleine-König wrote: > > > > On Wed, Sep 22, 2021 at 11:43:00AM +0200, Uwe Kleine-König wrote: > > > > > On i.MX7S and i.MX8M* (but not i.MX6*) the pwrkey device has an > > > > > associated clock. Accessing the registers requires that this clock is > > > > > enabled. Binding the driver on at least i.MX7S and i.MX8MP while not > > > > > having the clock enabled results in a complete hang of the machine. > > > > > (This usually only happens if snvs_pwrkey is built as a module and the > > > > > rtc-snvs driver isn't already bound because at bootup the required clk > > > > > is on and only gets disabled when the clk framework disables unused clks > > > > > late during boot.) > > > > > > > > > > This completes the fix in commit 135be16d3505 ("ARM: dts: imx7s: add > > > > > snvs clock to pwrkey"). > > > > > > > > > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@xxxxxxxxxxxxxx> > > > > > > > > This patch fixes a hard machine hang that occurs on an i.MX8MP based > > > > machine in ~10% of the boot ups. In my eyes it's suitable to be applied > > > > before v5.14 even. > > > > > > > > Any feedback on it? > > > > > > Sorry for the delay. As you may know I strongly dislike dev_err_probe() > > > as it conflates the 2 issue - error printing and noting the deferral > > > event that should be implemented by the resource providers (and I > > > believe Rob had WIP patches to push this reporting down too providers). > > > > I didn't know your dislike (and I probably will forget it again soon, > > given that there seems to be disagreement among maintainers :-), and > > from your words I don't understand it. The improved idea is that > > devm_clk_get_optional() already registers the deferral event for the > > clk? My first intuition is that this won't work, so I'd like to see the > > WIP series. (Added Rob to Cc.) Someone has a link? > > I think this is here: > > https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git/log/?h=dev_err-removal > > I do not think it adds calls to device_set_deferred_probe_reason() but > that should be trivial to add, given we have device pointer and ID of > the resource, which should be enough to track it. So the general idea is to push the error message into the resource providers. I think this lowers the quality of error messages in some cases. E.g. when clk_get() fails, the clk core can only emit something like dev_err(dev, "Failed to get clock '%s' (%pe)\n", id, clk); but the requesting driver has more information and might emit a more helpful message. Or the requesting driver might even be able to handle an error (even if it's not -ENODEV) and emitting an error is wrong. For example because id is NULL but the driver knows a good name or a detailed purpose of the clk can be mentioned to help debugging the issue. And IMHO even with this approach it's sensible to factor out the part: if (IS_ERR(x) && PTR_ERR(x) != -EPROBE_DEFER) dev_err(dev, ...); return PTR_ERR(x); Ah, it could use dev_err_probe() which has exactly the required semantic, so no need to reinvent the wheel. I don't know your bubble of work, in mine there are often customers who say something like "Look, my kernel error log contains these errors, why is there a problem with device X?" and the explanation is just that a driver emitted an error message in the -EPROBE_DEFER case, sometimes even several of them, e.g. once whenever an USB thumb drive is inserted into the machine. (ok, if this happens there is likely indeed a problem to fix, but then a single error message would still be nicer.) So I really like drivers that don't print an error on probe deferral. Also having drivers write something helpful (today!) to /sys/kernel/debug/devices_deferred is very appreciated, as it makes debugging quite a bit easier. dev_err_probe() helps for both. All in all pushing the error message reporting into the providers seems to be orthogonal to the question how to best report an error on the requester's side while the providers are still "unfixed". And not using an approach that has upsides today and not only when a certain series (that wasn't even posted yet) landed, seems wrong to me. > > Also I don't share that sentiment, given that today > > devm_clk_get_optional() and all the other resource providers don't do > > the necessary stuff for deferral handling, I strongly prefer to use the > > mechanism that is available today (even if it might be possible to > > improve it) instead of open coding it. And if it's only because once the > > improved variant is available it's easier to identify the code locations > > that need adaption if they all use a common function instead of > > identifying something like > > > > clk = devm_clk_get_optional(&pdev->dev, NULL); > > if (IS_ERR(clk)) { > > error = PTR_ERR(clk); > > if (error != -EPROBE_DEFER) > > dev_err(pdev->dev, "Failed to get clk: %pe\n", clk) > > else > > device_set_deferred_probe_reason(dev, oh_I_need_a_struct_va_format_how_do_I_get_this?); > > You do not, you happily ignore it and wait for providers to do it for > you instead of forcing the change through all drivers. I don't get this. What change is forced here? Today (where some providers don't emit an error message) the driver has to emit an error message itself. (Either using dev_err_probe() with the above mentioned advantages, by open coding it or by using a simpler and inferior procedure.) And once devm_clk_get_optional() is adapted to emit the error message itself, you have to adapt all callers anyhow, and which of the approaches a given driver selected doesn't make any relevant difference, you just remove the error message emitting there. But until devm_clk_get_optional() is "fixed" (in a year? Or two? Ever?) you force an inferior behaviour for users with your refusal to allow dev_err_probe(). > > return error; > > } > > > > instead of > > > > clk = devm_clk_get_optional(&pdev->dev, NULL); > > if (IS_ERR(clk)) > > return dev_err_probe(&pdev->dev, PTR_ERR(clk), "Failed to get clock\n"); > > > > Even if the driver does not call device_set_deferred_probe_reason(), the > > additional check for error != -EPROBE_DEFER is ugly, isn't it? > > I'd simply do > > clk = devm_clk_get_optional(...); > error = PTR_ERR_OR_ZERO(clk); > if (error) { > dev_err(&pdev->dev, "...", error); > return error; > } Done something similar now (not using PTR_ERR_OR_ZERO() which in my experience is even more controversial than dev_err_probe()). > > > Could you p lease resubmit with "normal" dev_err()/dev_warn()/etc and I > > > will be happy to apply. > > > > Is the above the variant you prefer? Maybe without the call to > > device_set_deferred_probe_reason()? Or maybe even without the check for > > -EPROBE_DEFER (which however might result in wrong error messages which > > is IMHO worse than the ugliness of the additional check)? > > Why are they wrong? They are not. They would literally say that some > resource was not obtained because it is not ready. With your approach you emit messages at the error level (i.e. where people assume there is a real problem that needs addressing), but this information is (in most cases) irrelevant and already wrong when someone reads it. So the only effect is delaying the boot process a bit and irritating people. Yes, technically it's right to emit an error on -EPROBE_DEFER, but its usefulness is negative. Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-König | Industrial Linux Solutions | https://www.pengutronix.de/ |
Attachment:
signature.asc
Description: PGP signature