[linux-pm] [PATCH 2/2] Fix console handling during suspend/resume

torvalds at osdl.org (Linus Torvalds) · Thu, 22 Jun 2006 12:23:26 -0700 (PDT)

On Thu, 22 Jun 2006, David Brownell wrote:

> On Thursday 22 June 2006 9:10 am, Linus Torvalds wrote:
> > 
> > The fact that worries me is that suspend-to-ram DOES NOT WORK FOR PEOPLE. 
> > I have never _ever_ met a laptop or machine of mine that "just worked". 
> > I've always had to fix something, and people always end up having to do 
> > something ridiculous like unlink all modules etc.
> 
> And when I've looked at the causes of such problems, they've been
> either (a) driver bugs, or (b) ACPI bugs.  As you know, both of
> them are hard to debug, especially when the symptom is on resume
> paths with no console.  (Oooh, see $SUBJECT, this isn't offtopic!!)

EXACTLY.

We're back to square one.

The #1 problem _by_far_ with suspend has absolutely ZERO to do with 
suspend being "hard", block device queues, or how to save driver state per 
se.

Each individual driver tends to be fairly easy to fix, I'd say. I suspect 
that even USB in the end is just a "Small Matter Of Programming", but it's 
a total bitch to debug.

Our problem is that it's damn hard to debug the mess, AND A LARGE PART OF 
THAT IS THAT STUPID INTERFACE!

Let's revisit why I want to do as much _independently_ of actually calling 
suspend() on a device again:

 - debugging is basically impossible during the _actual_ suspend sequence.

This is why we want to (nay, NEED) to split that "suspend()" function up, 
so that it doesn't do five different things. The more we can do _outside_ 
of suspend(), the better. Exactly because suspend() is a total bitch to 
debug, and because in order to actually do things like printk() and use 
netconsole, we want to minimize the amount of code that gets run in that 
state.

So I simply DO NOT CARE about stupid people doing operations that change 
the state of a device at the same time as a suspend. It's so far off my 
radar that it's not even funny. If you do something stupid, and the 
machine doesn't come up, it's YOUR fault.

I want the machine to come back when you _don't_ do anything stupid, and 
in order to do that, we need to make the suspend sequence more debuggable.

What I actually _care_ about is that I can have drivers do "printk()" in 
their "save_state()" routines, and we can have a debug mode that logs them 
to disk, and even do a "sync()" before the suspend() that hangs the 
machine, and we can get a f*cking clue about what is so special about that 
machine that it never comes back.

And there's NOT A WAY IN HELL we can do that with the current setup, 
exactly because the current "suspend()" does five different things, and 
trying to log anything even half-way informative at all (even to screen, 
but much less to network or to disk) is just not going to work at all, 
because by the time we hit half the devices, we've have done things that 
make logging impossible.

The actual final suspend() action will always be that way. There's nothing 
we can do about that (although my other patch - the [1/2] int he series 
that became the start of this thread - tries to at least put some 
infrastructure in place for that too). But we can sure as hell try to 
split that undebuggable section up, and at least make slightly _more_ of 
it debuggable.

			Linus