On Tue, Feb 18, 2025 at 6:45 AM Ajay Agarwal <ajayagarwal@xxxxxxxxxx> wrote: > > On Tue, Feb 18, 2025 at 11:07:19AM +0530, Ajay Agarwal wrote: > > On Mon, Feb 17, 2025 at 09:23:18PM +0100, Rafael J. Wysocki wrote: > > > On Mon, Feb 17, 2025 at 4:49 AM Ajay Agarwal <ajayagarwal@xxxxxxxxxx> wrote: > > > > > > > > On Wed, Feb 12, 2025 at 08:29:34PM +0100, Rafael J. Wysocki wrote: > > > > > On Tue, Feb 11, 2025 at 11:21 PM Brian Norris <briannorris@xxxxxxxxxx> wrote: > > > > > > > > > > > > Hi Ajay, > > > > > > > > > > > > On Mon, Feb 10, 2025 at 09:02:41AM +0530, Ajay Agarwal wrote: > > > > > > > On Wed, Jul 27, 2022 at 06:31:48PM +0200, Rafael J. Wysocki wrote: > > > > > > > > On Wed, Jul 27, 2022 at 10:08 AM Oliver Neukum <oneukum@xxxxxxxx> wrote: > > > > > > > > > On 26.07.22 17:41, Rafael J. Wysocki wrote: > > > > > > > > > > Well, in general suspending or resuming a device is a collaborative > > > > > > > > > > effort and if one of the pieces falls over, making it work again > > > > > > > > > > involves fixing up the failing piece and notifying the others that it > > > > > > > > > > is ready again. However, that part isn't covered and I'm not sure if > > > > > > > > > > it can be covered in a sufficiently generic way. > > > > > > > > > > > > > > > > > > True. But that still cannot solve the question what is to be done > > > > > > > > > if error handling fails. Hence my proposal: > > > > > > > > > - record all failures > > > > > > > > > - heed the record only when suspending > > > > > > > > > > > > > > > > I guess that would boil down to moving the power.runtime_error update > > > > > > > > from rpm_callback() to rpm_suspend()? > > > > > > > Resuming this discussion. One of the ways the device drivers are > > > > > > > clearing the runtime_error flag is by calling pm_runtime_set_suspended > > > > > > > [1]. > > > > > > > > > > I personally think that jumping on a 2.5 years old thread is not a > > > > > good idea. It would be better to restate the problem statement and > > > > > provide the link to the previous discussion. > > > > > > > > > > > > To me, it feels weird that a device driver calls pm_runtime_set_suspended > > > > > > > if the runtime_resume() has failed. It should be implied that the device > > > > > > > is in suspended state if the resume failed. > > > > > > > > > > > > > > So how really should the runtime_error flag be cleared? Should there be > > > > > > > a new API exposed to device drivers for this? Or should we plan for it > > > > > > > in the framework itself? > > > > > > > > > > > > While the API naming is unclear, that's exactly what > > > > > > pm_runtime_set_suspended() is about. Personally, I find it nice when a > > > > > > driver adds the comment "clear runtime_error flag", because otherwise > > > > > > it's not really obvious why a driver has to take care of "suspending" > > > > > > after a failed resume. But that's not the biggest question here, IMO. > > > > > > > > > > > > The real reson I pointed you at this thread was because I think it's > > > > > > useful to pursue the proposal above: to avoid setting a persistent > > > > > > "runtime_error" for resume failures. This seems to just create a pitfall > > > > > > for clients, as asked by Vincent and Oliver upthread. > > > > > > > > > > > > And along this line, there are relatively few drivers that actually > > > > > > bother to reset this error flag ever (e.g., commit f2bc2afe34c1 > > > > > > ("accel/ivpu: Clear runtime_error after pm_runtime_resume_and_get() > > > > > > fails")). > > > > > > > > > > > > So to me, we should simply answer Rafael's question: > > > > > > > > > > > > (repeated:) > > > > > > > > I guess that would boil down to moving the power.runtime_error update > > > > > > > > from rpm_callback() to rpm_suspend()? > > > > > > > > > > > > Yes, I think so. (Although I'm not sure if this leaves undesirable spam > > > > > > where persistent .runtime_resume() failures occur.) > > > > > > > > > > > > ...and then write/test/submit such a patch, provided it achieves the > > > > > > desired results. > > > > > > > > > > > > Unless of course one of the thread participants here has some other > > > > > > update in the intervening 2.5 years, or if Rafael was simply asking the > > > > > > above rhetorically, and wasn't actually interested in fielding such a > > > > > > change. > > > > > > > > > > The reason why runtime_error is there is to prevent runtime PM > > > > > callbacks from being run until something is done about the error, > > > > > under the assumption that running them in that case may make the > > > > > problem worse. > > > > > > > > > > I'm not sure if I see a substantial difference between suspend and > > > > > resume in that respect: If any of them fails, the state of the device > > > > > is kind of unstable. In particular, if resume fails and the device > > > > > doesn't actually resume, something needs to be done about it or it > > > > > just becomes unusable. > > > > > > > > > > Now, the way of clearing the error may not be super-convenient, which > > > > > was a bit hard to figure out upfront, so I'm not against making any > > > > > changes as long as there are sufficient reasons for making them. > > > > > > > > I am thinking if we can start with a change to not check runtime_error > > > > in rpm_resume, and let it go through even if the previous rpm_resume > > > > attempt failed. Something like this: > > > > > > > > ``` > > > > static int rpm_resume(struct device *dev, int rpmflags) > > > > trace_rpm_resume(dev, rpmflags); > > > > > > > > repeat: > > > > - if (dev->power.runtime_error) { > > > > - retval = -EINVAL; > > > > - } else if (dev->power.disable_depth > 0) { > > > > + if (dev->power.disable_depth > 0) { > > > > if (dev->power.runtime_status == RPM_ACTIVE && > > > > dev->power.last_status == RPM_ACTIVE) > > > > retval = 1; > > > > ``` > > > > > > > > I think setting the runtime_error in rpm_callback, i.e. for both resume > > > > and suspend is still a good idea for book-keeping purposes, e.g. the > > > > user reading the runtime_status of the device from sysfs. > > > > > > What would be the benefit of this change? > > The benefit would be that the runtime_resume would be re-attempted even if > > the previous attempt failed. > > Actually, I wanted to propose the removal of `runtime_error` flag > completely from the code. Why? > But it sounded too disruptive to me. Hence, I > proposed the milder patch of removal of `runtime_error` check from > rpm_resume so that the drivers do not have to call > `pm_runtime_set_suspended` explicitly. > > Basically, we still do not have a good solution for the situation where > one of the ancestors fails to resume. If an ancestor fails to resume, runtime_error is not set for the device at hand. rpm_resume() returns -EBUSY without setting runtime_error in that case. > We do not know how to make the > ancestor working again. That's true, but poking at it again and again may not work either. > But I guess a re-attempt is better than not > doing anything about it? I would start with changing rpm_callback() to avoid setting runtime_error if the retval is -EBUSY or -EAGAIN, to make rpm_resume() work consistently with rpm_suspend() in that respect. Then, no corrective action would be needed after returning one of these from the callback. I'll send a patch for this shortly. -EACCES is kind of a grey area. It is converted to -EIO, but then it indicates an error that can be corrected outside the driver and is not related to the hardware, so it should be OK to avoid setting runtime_error in that case either, but then it might be better to convert it to -EAGAIN. Returning any other error codes indicates a hardware issue, however, and those should be corrected by the driver handling the given piece of hardware.