Re: PM runtime_error handling missing in many drivers?

"Rafael J. Wysocki" <rafael@xxxxxxxxxx> · Tue, 26 Jul 2022 17:41:38 +0200

On Tue, Jul 26, 2022 at 11:05 AM Oliver Neukum <oneukum@xxxxxxxx> wrote:
>
>
>
> On 08.07.22 22:10, Rafael J. Wysocki wrote:
> > On 7/8/2022 1:03 PM, Vincent Whitchurch wrote:
>
> >> Perhaps Rafael can shed some light on this.
> >
> > The driver always knows more than the framework about the device's
> > actual state.  The framework only knows that something failed, but it
> > doesn't know what it was and what way it failed.
>
> Hi,
>
> thinking long and deeply about this I do not think that this seemingly
> obvious assertion is actually correct.

I guess that depends on what is regarded as "the framework".  I mean
the PM-runtime code, excluding the bus type or equivalent.

> > The idea was that drivers would clear these errors.
>
> I am afraid that is a deeply hidden layering violation. Yes, a driver's
> resume() method may have failed. In that case, if that is the same
> driver, it will obviously already know about the failure.

So presumably it will do something to recover and avoid returning the
error in the first place.

>From the PM-runtime core code perspective, if an error is returned by
a suspend callback and it is not -EBUSY or -EAGAIN, the subsequent
suspend is also likely to fail.

If a resume callback returns an error, any subsequent suspend or
resume operations are likely to fail.

Storing the error effectively prevents subsequent operations from
being carried out in both cases and that's why it is done.

> PM operations, however, are operating on a tree. A driver requesting
> a resume may get an error code. But it has no idea where this error
> comes from. The generic code knows at least that.

Well, what do you mean by "the generic code"?

> Let's look at at a USB storage device. The request to resume comes
> from sd.c. sd.c is certainly not equipped to handle a PCI error
> condition that has prevented a USB host controller from resuming.

Sure, but this doesn't mean that suspending or resuming the device is
a good idea until the error condition gets resolved.

> I am afraid this part of the API has issues. And they keep growing
> the more we divorce the device driver from the bus driver, which
> actually does the PM operation.

Well, in general suspending or resuming a device is a collaborative
effort and if one of the pieces falls over, making it work again
involves fixing up the failing piece and notifying the others that it
is ready again.  However, that part isn't covered and I'm not sure if
it can be covered in a sufficiently generic way.