I was running 6 G5 machines with various FC4 kernels up to and including 2.6.14-1.1656_FC4. However, our two newest 2.7GHz dual processor machines were be powering off by the therm_pm72 driver because of overheating. The problem was confirmed by use of the /sbin/critical_overtemp callback that the therm_pm72 driver provides. Since we are using these machines as compute boxes, we have been limping along with a critical_overtemp script that logged the invocation and rebooted (instead of powering off.) Recently I saw that there was patch to 2.6.15 to fix a bug in therm_pm72 that was contributing to the overtemp situation, specifically this patch: commit 6ee7fb7e363aa8828b3920422416707c79f39007 Author: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> Date: Mon Dec 19 11:24:53 2005 +1100 [PATCH] powerpc: g5 thermal overtemp bug The g5 thermal control for liquid cooled machines has a small bug, when the temperatures gets too high, it boosts all fans to the max, but incorrectly sets the liquids pump to the min instead of the max speed, thus causing the overtemp condition not to clear and the machine to shut down after a while. This fixes it to set the pumps to max speed instead. This problem might explain some of the reports of random shutdowns that some g5 users have been reporting in the past. Many thanks to Marcus Rothe for spending a lot of time trying various patches & sending log logs before I found out that typo. Note that overtemp handling is still not perfect and the machine might still shutdown, that patch should reduce if not eliminate such occcurences in "normal" conditions with high load. I'll implement a better handling with proper slowing down of the CPUs later. Signed-off-by: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx> I saw that the 2.6.15-1.1823_FC4 kernel had this patch so I tried that but now have a different problem. The cooling system runs full blast because the hardware is receiving the usually once a second commands from the OS. This is very similar to the situation with the old FC4 kernels such as 2.6.11-1.1369_FC4 where the therm_pm72 driver was not enabled because it only checked for PowerMac7,2 machines (as suggested by pm72 in the driver name) and the new machines were detected as PowerMac7,3. The machines no longer reboot, but the room that they are in (a shared office for 3) is not inhabitable because of the noise. Today I switched to 2.6.15-1.1824_FC4 to confirm that the situation remains unchanged. The only thing I notice superficially different about the kernals it that the therm_pm72 was built as a kernel module in all FC4 kernels until now but this has changed to be built into the kernel directly. I'm not sure if this could be a problem but I wanted to mention it. In any case, for the short term my officemates and I have fled to other offices. I'll probably tack a wack at debugging the problem, but wanted to post to let people know of the problem with the new kernel, and old kernel for that matter! -bri P.S. - I'll also note that benh, the driver author said elsewhere that the overtemp problem is partially a manufacturing problem with the later machines. We seem to be seeing that as our original 2.7GHz machine that we got when they first came out does not have any cooling problems even though its our server and has a far higher load that the other machines. -bri -- fedora-test-list mailing list fedora-test-list@xxxxxxxxxx To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-test-list