Comment # 26
on bug 93341
from Jean-François Fortin Tam
OK, I've got good news... Julien, thanks to the crazy furry donut "torture test" you suggested, I was able to finally pinpoint the real trigger for this bug. My understanding is that on Radeons (well, at least the Radeon HD 7770), there is an emergency mechanism in the hardware (or firmware/microcode maybe) that activates self-throttling of performances when the GPU reaches a critical temperature. Normally, the video driver is supposed to handle this state change gracefully, however the radeonsi/radeon/amdgpu driver on Linux does not, so the kernel panics because the driver went belly up. During additional testing today, where I forced my GPU to overheat, I was able to determine that the critical point is the same as on Windows: 113 degrees Celsius. As soon as you go over 112... boom, dead radeonsi driver + kernel oops (with the same error messages as my previous logs above). Additionally, lm_sensors thinks the temperature has instantly jumped to 511 degrees Celsius (!), and the readings stay stuck at 511 Celsius. "Duh! Just get better cooling!" might sound like a workaround (just like keeping the case open), but nope, technically, it's still a software/driver issue: the Linux driver should handle such scenarios gracefully just as well as the Windows driver. In Windows, breaching the 110-113 degrees Celsius limit results in the video driver simply dropping frames massively, continuing to function at reduced performance (ie: going from 40-60 fps to 10-15 fps on one of my benchmarks). The system never crashes. So the bug here, as I understand it, is that the radeonsi driver on Linux does not handle the event where the hardware force-throttles itself. --------- Contextual notes: The reason why I only started experiencing this issue in December 2015 (as I've had the GPU since 2012) was that I changed my PC case then, which means a different airflow and cooling behavior... And the reason why it was so hard to get consistent crashes here was that when I was trying to troubleshoot it, I was sometimes doing it with the case closed, sometimes with the case open (when trying with a different power supply unit using a "siamese transplant" across another computer, for example). If I keep my case open, the card will never reach the critical temperature and so the issue will not happen. I might get a system "freeze" (possibly saying "*ERROR* si_restrict_performance_levels_before_switch failed") after many hours of torture testing, but the symptoms are different (the screen does not turn off, image stays on with everything frozen, and nothing else in the logs) and so I presume that to be a different issue.
You are receiving this mail because:
- You are the assignee for the bug.
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel