Ok, PowerPC Decrementer 101 The processor contains a special register, the decrementer, which keeps ... decrementing. It can be set to any arbitrary value at any time and will decrement in sync with the processor timebase. There are some subtle differences between implementations regarding what happens when reaching 0, but the basic idea is that you get an interrupt (depending on the processor, that interrupt is somewhat a level interrupt asserted when the decrementer is negative or it can be a kind of edge interrupt queued up when the dec transitions from 0 to -1). This decrementer is used as the main timer. Thus it needs to be operating normally at all time until interrupts are off or the scheduler will stop working properly, kernel timers will not fire, etc... (and saying that platforms devices should use mdelay instead is just gross, I won't even go there. Interrupts are still on -> the core kernel should operate normally and that includes the main timer source). Now what happens when we put the processors (well, most desktop processors, at least the one that concern us in that discussion) to sleep is that they get out of sleep when an interrupt occur, for example ... a decrementer interrupt. This is not good for STR for various reasons related to the way STR works in hardware (the northbridge snoops that the CPU is going to sleep and starts putting things down, ultimately shutting the CPU off, it can't really cope if the CPU wakes up right away and start doing things). Unfortunately, for other reasons, the procedure of putting the CPU to sleep involves turning interrupts on. For all external interrupts, that isn't a problem as we have previously shut them all down on the main PIC, but it is with the DEC. The "trick" is that once interrupts are off, we want the DEC to be set to such a high value that it won't tick anytime soon (that is actually several seconds, enough in practice). But if we do that after IRQs have been turned off (from a sysdev), we have the risk that it might have ticked between turning IRQs off and our sysdev, and thus a DEC interrupts is already "queued up" (especially on CPUs where it acts as an edge interrupt) and will screw up our attempt to put the CPU to sleep later on. The procedure we use is to set it to 0x7fffffff with IRQs on, then turn IRQs off, then set it back to 0x7fffffff in case it kicked in just before and the timer interrupt set it back to a short value. As you can imagine, thoseh have to be done close together as part of the main irq disabling procedure, after platform devices have run (that is we can consider the scheduler as "off") and before sysdev's etc... Now, in addition to that, we have some weird motherboard stuff we need to turn off/on, which has to be done after drivers (because it renders various busses inaccessible in some cases, and might cause DMA snooping to stop working, I'm not 100% sure, but I know for sure it has to be done late) but can't be done as a sysdev because we need some infrastructure like the i2c stuff (and others) that requires semaphores and timers. It's based on something remotely akin to AML in that we have to execute "scripts" provided by the firmware and the code to do so need to run in an environment where scheduler & timers are operating. That later thing could be dealt with using a platform device if we could guarantee that platform device is put to sleep last of all devices in the tree and woken up first. Right now, we have no such guarantee and no mecanism for it, and I don't see a solution showing up for 2.6.22 In the long run, we might be able to break up that phase to have each individual device that has such functions associated have ways to call into them after the device has been put to sleep, but that involves more complication, probably hook in the generic PCI code etc... and more ordering issues vs. some motherboard foo so it's definitely not on the short term radar. For all those reasons, I do think that the proper, clean and incremental approach to get our stuff working is to have that pair of hooks allowing us to "replace" the local_irq_disable/enable calls... Now it does not need to be pm_ops. I'm fine with arch_pm_irq_quiesce() kind of thing (or find a better name if you can, maybe arch_pm_after_devices_suspend() arch_pm_before_device_wakeup() ?) and have the default implementation of these just do local_irq_disable/enable. It's basically about quiescing the scheduler/timers, which on powerpc (bcs of the way the DEC operates) requires a little bit more than just a call to local_irq_disable. And once the hook is there, use it for some other arch specific bits that we can't quite fit anywhere else at the moment. Ben. _______________________________________________ linux-pm mailing list linux-pm@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/linux-pm