On Monday 23 February 2009, Eric W. Biederman wrote: > Ingo Molnar <mingo@xxxxxxx> writes: > > > * Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > > > >> Ingo Molnar <mingo@xxxxxxx> writes: > >> > >> > I think this aspect has been well-understood during the > >> > discussion of this topic and it's just a slightly misleading > >> > changelog. > >> > >> As I was a member of that discussion I did not see that. > >> > >> It took me several passes through the patches to realize the > >> goal is to allow drivers to be able to sleep while they are in > >> their late pm shutdown routines. > >> > >> Why we want this I don't know. But it seems simple enough to > >> implement, and it makes it harder to get the late pm suspend > >> routines wrong, which is always good. > > > > That's not the only goal. The other goal is to further shrink a > > particular window of suspend fragility: the irqs-disabled stage > > of the suspend/resume sequence. > > > > Since suspend/resume is a mini-reboot sequence, there's a large > > amount of code executed - and the variety of code is large as > > well. We had repeat cases of random drivers re-enabling > > interrupts and thus breaking other drivers - and these are nasty > > to debug. > > > > So this patchset disables device IRQs centrally and serializes > > with pending work - so there's no races with pending IRQs > > anymore. > > > > The fact that we keep the timer irq running is two-fold: firstly > > the timer code is special and not really part of the regular > > suspend/resume sequence. > > > > Drivers want to take timestamps, sometimes they even want to do > > a small usleep(), etc. Ideally the suspend/resume code is pretty > > much _the same_ as a regular bootup (and shutdown) code - so we > > want to provide a similar environment to how drivers initialize > > and deinitialize, and we want to enable them to share code > > between bootup/shutdown and suspend/resume agressively. > > > > So the more generic kernel environment we give these fragile > > handlers, the better we are off in the end. Since we already had > > IRQS_TIMER, that was just the natural thing to do. > > I am all for sharing code, especially if we can factor if > we can find common factors that do the same thing. > > I don't know how many times I have found drivers doing something > weird in their shutdown routines that they don't know how > to get the device out of. The e1000 driver has shown up several > times because it likes to suspend the device on shutdown. > > The fact that the methods exposed to drivers were only defined > to be usable on the s2ram/hibernate path is something I have > brought up on more than one occasion as a bad choice. > > I'm really not convinced that the rational for separating > out the shutdown methods from the remove methods has > been very good. That of we don't need to clean up the in-kernel > data structures on reboot so why do something extra that can > introduce instability. > > So having been watching a smaller form of this drama on the > reboot path for several years. Having had a device method > with fixed semantics, and not the dwm sematics of the historical > suspend routing. I expect there is still a ways to go before > it is simple and easy for drivers to figure out what they need > to implement out of the confusing variety of possible device > methods. > > >> > The new suspend code does not rely on truly disabling IRQs > >> > on the low level. The purpose is to not get IRQs to drivers > >> > - which might crash/hang/race/misbehave. > >> > >> Reasonable. I expect one of the problems with drivers getting > >> it wrong is that the interface is too complex for mortal > >> humans to understand. > > > > The suspend/resume state machine certainly used to be a piece of > > code that makes a seasoned kernel developer weep in fear. > > > > That has changed drastically in the past few months. The > > suspend+hibernation logic got unified (at least as far as driver > > methods go), and all the flow and ordering has been cleaned up > > and has been made more robust. > > I will have to look again. My impression is that overloading > a single method is part of what got us into this mess in the > first place. > > No that I don't see things getting better. > > > What makes s2ram fragile is not human failure but the > > combination of a handful of physical property: > > > > 1) Psychology: shutting the lid or pushing the suspend button is > > a deceivingly 'simple' action to the user. But under the > > hood, a ton of stuff happens: we deinitialize a lot of > > things, we go through _all hardware state_, and we do so in a > > serial fashion. If just one piece fails to do the right > > thing, the box might not resume. Still, the user expects this > > 'simple' thing to just work, all the time. No excuses > > accepted. > > > > 2) Length of code: To get a successful s2ram sequence the kernel > > runs through tens of thousands of lines of code. Code which > > never gets executed on a normal box - only if we s2ram. If > > just one step fails, we get a hung box. > > > > 3) Debuggability: a lot of s2ram code runs with the console off, > > making any bugs hard to debug. Furthermore we have no > > meaningful persistent storage either for kernel bug messages. > > The RTC trick of PM_DEBUG works but is a very narrow channel > > of information and it takes a lot of time to debug a bug via > > that method. > > Yep that is an issue. > > > The combination of these factors really makes up for a perfect > > storm in terms of kernel technology: we have this > > very-deceivingly-simple-looking but complex-and-rarely-executed > > piece of code, which is very hard to debug. > > And much of this as you are finding with this piece of code > is how the software was designed rather then how the software > needed to be. > > > Even just one of these factors would be enough to make an > > otherwise healthy subsystem fragile - no wonder s2ram has been a > > problem ever since it existed in the upstream kernel. > > > > So now we need just one thing: patience and more of the same > > good stuff that happened lately. > > I think there has been some good progress, and so I am happy > to be patient. I will still mention on occasion what it > seems we are doing wrong. Unfortunately I don't have time > to do a lot more than that. > > >> > Still, it might make sense to not just use the ->disable > >> > sequence but primarily the ->shutdown irqchip method (when > >> > it's available in the irqchip). > >> > >> Disable seems fine to me. This is interesting in the context > >> of all of the irqs that will when masked show up somewhere > >> else (think boot interrupts). > >> > >> > While we obviously cannot turn off the PIC that delivers > >> > timer IRQs at this stage - there's no theoretical reason why > >> > the suspend sequence couldnt power down some secondary PICs > >> > as well - in some arch code, or maybe even in the generic > >> > driver suspend sequence if the device tree is structured > >> > carefully enough so that the PIC gets turned off last. > >> > >> If the point is simply to prevent deliver of irqs to the > >> drivers I don't see the point of anything more than what the > >> patch does now. > > > > ... except for the usecase i described above. Say some PIC sits > > on a piece of silicon which gets turned off. I'm not talking > > about x86 but some custom device. We really dont want that IRQ > > line to send half of an IRQ message (un-ACK-ed) when it gets > > turned off. So physically 'suspending' all IRQ lines does make a > > certain level of long-term sense. > > Good point. We will loose both level and edge triggered events > that occur between suspending the irqs and restoring them but > that is inevitable. So we might as well call shutdown and totally > turn off the irqs if we can. > > I don't know where in the state machine this is getting called but > I would suggest doing this before we shutdown cpus. This is the plan. In fact, I'm going to do this in the next patch after the $subject one has been tested and found acceptable. Thanks, Rafael _______________________________________________ linux-pm mailing list linux-pm@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/linux-pm