Re: PM runtime_error handling missing in many drivers?

Ajay Agarwal <ajayagarwal@xxxxxxxxxx> · Mon, 10 Feb 2025 09:02:41 +0530

On Wed, Jul 27, 2022 at 06:31:48PM +0200, Rafael J. Wysocki wrote:
> On Wed, Jul 27, 2022 at 10:08 AM Oliver Neukum <oneukum@xxxxxxxx> wrote:
> >
> >
> >
> > On 26.07.22 17:41, Rafael J. Wysocki wrote:
> > > On Tue, Jul 26, 2022 at 11:05 AM Oliver Neukum <oneukum@xxxxxxxx> wrote:
> >
> > > I guess that depends on what is regarded as "the framework".  I mean
> > > the PM-runtime code, excluding the bus type or equivalent.
> >
> > Yes, we have multiple candidates in the generic case. Easy to overengineer.
> >
> > >>> The idea was that drivers would clear these errors.
> > >>
> > >> I am afraid that is a deeply hidden layering violation. Yes, a driver's
> > >> resume() method may have failed. In that case, if that is the same
> > >> driver, it will obviously already know about the failure.
> > >
> > > So presumably it will do something to recover and avoid returning the
> > > error in the first place.
> >
> > Yes, but that does not help us if they do return an error.
> >
> > > From the PM-runtime core code perspective, if an error is returned by
> > > a suspend callback and it is not -EBUSY or -EAGAIN, the subsequent
> > > suspend is also likely to fail.
> >
> > True.
> >
> > > If a resume callback returns an error, any subsequent suspend or
> > > resume operations are likely to fail.
> >
> > Also true, but the consequences are different.
> >
> > > Storing the error effectively prevents subsequent operations from
> > > being carried out in both cases and that's why it is done.
> >
> > I am afraid seeing these two operations as equivalent for this
> > purpose is a problem for two reasons:
> >
> > 1. suspend can be initiated by the generic framework
> 
> Resume can be initiated by generic code too.
> 
> > 2. a failure to suspend leads to worse power consumption,
> >    while a failure to resume is -EIO, at best
> 
> Yes, a failure to resume is a big deal.
> 
> > >> PM operations, however, are operating on a tree. A driver requesting
> > >> a resume may get an error code. But it has no idea where this error
> > >> comes from. The generic code knows at least that.
> > >
> > > Well, what do you mean by "the generic code"?
> >
> > In this case the device model, which has the tree and all dependencies.
> > Error handling here is potentially very complicated because
> >
> > 1. a driver can experience an error from a node higher in the tree
> 
> Well, there can be an error coming from a parent or a supplier, but
> the driver will not receive it directly.
> 
> > 2. a driver can trigger a failure in a sibling
> > 3. a driver for a node can be less specific than the drivers higher up
> 
> I'm not sure I understand the above correctly.
> 
> > Reducing this to a single error condition is difficult.
> 
> Fair enough.
> 
> > Suppose you have a USB device with two interfaces. The driver for A
> > initiates a resume. Interface A is resumed; B reports an error.
> > Should this block further attempts to suspend the whole device?
> 
> It should IMV.
> 
> > >> Let's look at at a USB storage device. The request to resume comes
> > >> from sd.c. sd.c is certainly not equipped to handle a PCI error
> > >> condition that has prevented a USB host controller from resuming.
> > >
> > > Sure, but this doesn't mean that suspending or resuming the device is
> > > a good idea until the error condition gets resolved.
> >
> > Suspending clearly yes. Resuming is another matter. It has to work
> > if you want to operate without errors.
> 
> Well, it has to physically work in the first place.  If it doesn't,
> the rest is a bit moot, because you end up with a non-functional
> device that appears to be permanently suspended.
> 
> > >> I am afraid this part of the API has issues. And they keep growing
> > >> the more we divorce the device driver from the bus driver, which
> > >> actually does the PM operation.
> > >
> > > Well, in general suspending or resuming a device is a collaborative
> > > effort and if one of the pieces falls over, making it work again
> > > involves fixing up the failing piece and notifying the others that it
> > > is ready again.  However, that part isn't covered and I'm not sure if
> > > it can be covered in a sufficiently generic way.
> >
> > True. But that still cannot solve the question what is to be done
> > if error handling fails. Hence my proposal:
> > - record all failures
> > - heed the record only when suspending
> 
> I guess that would boil down to moving the power.runtime_error update
> from rpm_callback() to rpm_suspend()?
Resuming this discussion. One of the ways the device drivers are
clearing the runtime_error flag is by calling pm_runtime_set_suspended
[1].

To me, it feels weird that a device driver calls pm_runtime_set_suspended
if the runtime_resume() has failed. It should be implied that the device
is in suspended state if the resume failed.

So how really should the runtime_error flag be cleared? Should there be
a new API exposed to device drivers for this? Or should we plan for it
in the framework itself?

1: https://lore.kernel.org/all/20250129124009.1039982-3-jacek.lawrynowicz@xxxxxxxxxxxxxxx/