On Sat, 18 Nov 2006, Starikovskiy, Alexey Y wrote: > > May because it does not have a single common line with the previous > patch? Yeah, I do agree that it _looks_ very different as a patch, but it ends up having all the same execution profiles.. It's been too long since I debugged the previous problem, so I don't remember the exact details any more (back then I enabled ACPI debugging and watched the messages scroll by etc - this time I initially thought it was interrupt-related due to the other irq problems we've had, so I started bisecting immediately _without_ doing any ACPI debugging stuff, and by the time I actually bisected down enough, I recognized the problem, so I didn't do all the same "enable ACPI messages and look deeply into what is going on" thing). But if I remember correctly, what happens is _roughly_ something like this: - thermal event happens - the CPU is getting warm, and the fan needs to start up. Quite often, this happened early during boot (which is quite busy - some init scripts are disgustingly CPU-intensive mainly due to using inefficient scripting languages), but if it didn't happen there, it's easy enough to force to happen other ways. - part of the handling is "acpi_os_execute()" for something (don't ask me what), but the interestign thing is how that "acpi_os_execue()" then ends up causing a _recursive_ event. - we handle the original event in kacpid, and hand over the new one as a notification event. But the event keeps on happening, and kacpid keeps on running, and the other thread doesn't actually ever _run_ because kacpid holds he ACPI lock and is constantly busy. - we not only are constantly running in kernel space, we also end up eventually running out of memory for allocating all the work queue entries. So the reason the old code works is because everything is done in a single thread, and yes, we end up getting multiple events, but because the queue is all done onto the same queue that is _handling_ the events in the first place, and because it's a FIFO queue, the notification events get handled _before_ the later events. So with the single-threaded situation, you basically end up always doing the events in the same order they came in. In the "two separate threads" case, you don't, and one thread will end up generating events forever, waiting for them to happen, but they never _do_ happen, so you have a lockup _and_ eventually an infinite event queue for the other thread. > Or may be because it fixes all the current AMD-HP notebooks? > Or may be because it did not fail while being in -mm? I'm afraid that -mm doesn't get as much testing as it used to get. Also, I do realize that the patch fixes other problems, but we have long had a very strict policy that we do NOT accept regressions. Immediately when you start accepting regressions, you will never know whether you're going forward of backwards. It's better to have a known _old_ bug than to introduce a new one. So the "no regressions!" rule ends up trumping pretty much every single other issue. It's unacceptable to have machines that used to work, suddenly stop working. Even if it fixes another machine. ACPI didn't use to have that rule, and it was wild and crazy. Maybe more bugs got fixed, but the problem with accepting regressions is that nobody can _ever_ trust that system. You do not want to have people _afraid_ of upgrading - they should feel confident that upgrading never introduces any new problems. (Of course, that can never be reached 100%, but it's very much part of the goal. It kind of falls into the same "backwards compatibility on interfaces" absolute goal: it's ok to do new things, but you can never allow them to break old programs) > I will not "sneak it in" again, I promise. Feel free to send me test patches when working on these things, because I have no trouble at all to test my particular machine. I think you'll find the ACPI dumps etc for that machine in your archives, because I've sent them to Len and the acpi lists several times, but if you want to get AML disassemblies etc, just tell me how. I've done them before, but I work on this seldom enough that I always forget what the magic incantations are, and where to get the tools etc. Linus - To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html