On Wed, Jul 27, 2022 at 06:31:48PM +0200, Rafael J. Wysocki wrote: > On Wed, Jul 27, 2022 at 10:08 AM Oliver Neukum <oneukum@xxxxxxxx> wrote: > > > > > > > > On 26.07.22 17:41, Rafael J. Wysocki wrote: > > > On Tue, Jul 26, 2022 at 11:05 AM Oliver Neukum <oneukum@xxxxxxxx> wrote: > > > > > I guess that depends on what is regarded as "the framework". I mean > > > the PM-runtime code, excluding the bus type or equivalent. > > > > Yes, we have multiple candidates in the generic case. Easy to overengineer. > > > > >>> The idea was that drivers would clear these errors. > > >> > > >> I am afraid that is a deeply hidden layering violation. Yes, a driver's > > >> resume() method may have failed. In that case, if that is the same > > >> driver, it will obviously already know about the failure. > > > > > > So presumably it will do something to recover and avoid returning the > > > error in the first place. > > > > Yes, but that does not help us if they do return an error. > > > > > From the PM-runtime core code perspective, if an error is returned by > > > a suspend callback and it is not -EBUSY or -EAGAIN, the subsequent > > > suspend is also likely to fail. > > > > True. > > > > > If a resume callback returns an error, any subsequent suspend or > > > resume operations are likely to fail. > > > > Also true, but the consequences are different. > > > > > Storing the error effectively prevents subsequent operations from > > > being carried out in both cases and that's why it is done. > > > > I am afraid seeing these two operations as equivalent for this > > purpose is a problem for two reasons: > > > > 1. suspend can be initiated by the generic framework > > Resume can be initiated by generic code too. > > > 2. a failure to suspend leads to worse power consumption, > > while a failure to resume is -EIO, at best > > Yes, a failure to resume is a big deal. > > > >> PM operations, however, are operating on a tree. A driver requesting > > >> a resume may get an error code. But it has no idea where this error > > >> comes from. The generic code knows at least that. > > > > > > Well, what do you mean by "the generic code"? > > > > In this case the device model, which has the tree and all dependencies. > > Error handling here is potentially very complicated because > > > > 1. a driver can experience an error from a node higher in the tree > > Well, there can be an error coming from a parent or a supplier, but > the driver will not receive it directly. > > > 2. a driver can trigger a failure in a sibling > > 3. a driver for a node can be less specific than the drivers higher up > > I'm not sure I understand the above correctly. > > > Reducing this to a single error condition is difficult. > > Fair enough. > > > Suppose you have a USB device with two interfaces. The driver for A > > initiates a resume. Interface A is resumed; B reports an error. > > Should this block further attempts to suspend the whole device? > > It should IMV. > > > >> Let's look at at a USB storage device. The request to resume comes > > >> from sd.c. sd.c is certainly not equipped to handle a PCI error > > >> condition that has prevented a USB host controller from resuming. > > > > > > Sure, but this doesn't mean that suspending or resuming the device is > > > a good idea until the error condition gets resolved. > > > > Suspending clearly yes. Resuming is another matter. It has to work > > if you want to operate without errors. > > Well, it has to physically work in the first place. If it doesn't, > the rest is a bit moot, because you end up with a non-functional > device that appears to be permanently suspended. > > > >> I am afraid this part of the API has issues. And they keep growing > > >> the more we divorce the device driver from the bus driver, which > > >> actually does the PM operation. > > > > > > Well, in general suspending or resuming a device is a collaborative > > > effort and if one of the pieces falls over, making it work again > > > involves fixing up the failing piece and notifying the others that it > > > is ready again. However, that part isn't covered and I'm not sure if > > > it can be covered in a sufficiently generic way. > > > > True. But that still cannot solve the question what is to be done > > if error handling fails. Hence my proposal: > > - record all failures > > - heed the record only when suspending > > I guess that would boil down to moving the power.runtime_error update > from rpm_callback() to rpm_suspend()? Resuming this discussion. One of the ways the device drivers are clearing the runtime_error flag is by calling pm_runtime_set_suspended [1]. To me, it feels weird that a device driver calls pm_runtime_set_suspended if the runtime_resume() has failed. It should be implied that the device is in suspended state if the resume failed. So how really should the runtime_error flag be cleared? Should there be a new API exposed to device drivers for this? Or should we plan for it in the framework itself? 1: https://lore.kernel.org/all/20250129124009.1039982-3-jacek.lawrynowicz@xxxxxxxxxxxxxxx/