Comment # 6
on bug 105733
from Allan
TL;DR : I don't have any idea of what is happening. The errors aren't clear and I didn't find a discrete way of reproducing it and I'm in need of help. That's exactly the problem... I'm getting crazy about this problem. I've been trying to understand what is happening for weeks... So... I'll give you a brief(long) description : I've been running an RX 580. And then sometimes the system used to freeze like this and I was starting to think about the card being problematic. Then I got an RX 480, and I was planning to sell the RX580. I compiled a kernel with the polaris binaries and etc... It was going very well until a system upgrade. Then "here we go again" ... same problems... and now it seems like RX 480 fails twice as fast as the RX580 fails. If you are asking yourself "what kind of failures ?" I'll resume it : code 147, code 146, chrome_dthread libxul.so (for both firefox and chromium), a big call trace telling about amdgpu blocked for more than 120 seconds. Everything after the screen being frozen, ignoring the keyboard and mouse clicks, the only thing that really works is the mouse cursor moving. When it happens? After a few minutes running youtube or unigine valley or some random time (from minutes to several hours) using an opencl task for example. Then I started to think about the other components... - RAM ? Checked and running.... if the screen hangs, some ssh tests run fine. - CPU ? Never had a problem about it as far as I remember. Ssh tests run fine. - MOBO ? I really don't know. That's why : ---- I had been having some sound cracklings, indicating that some power management could be tainted. ---- I noticed that disabling IOMMU decreased the amount of crashes significantly... but unfortunately after updating the BIOS/EFI the option of enabling/disabling it simply was removed... I'll be contacting the manufacturer. So I can't affirm that it was the cause. ---- I started to think that something nasty was going on with the power supply. - POWER SUPPLY ? I bet that it is not ---- I have an 5 yeras old Aerocool 80 plus silver 800W power supply. It always had been a very good PSU... holding a HD7970GHz (290W TDP) most part of the time without a single problem. ---- But okay... maybe the capacitors were faulty (as the mobo manufacturer said when I asked about the sound). Then I bought an AX860i. And if there is any better PSU than this for the 800W range... I'd like to know. 80 plus platinum certified... and even that the certification system does not get verified for years (almost like irrelevant to be honest). I already had an Corsair HX600 before and it was outstanding... an AX is better than a HX so... only a titanium that costs more than my mobo and cpu togheter would be better then. ---- Guess what? The same problems. Actually, now, it shuts down sometimes. - KERNEL ? I was thinking that the problem was 4.15 because it has like 5x more chance of failling. But it also occurs with the very stable 4.13. Maybe I'll try other kernels... but as further we go behind with kernel versions, less features we have with amdgpu AFAIK. ---- Also. With the RX480 it started to fail the video output when I configure the Display Port output to be 144Hz. My screen can handle 160Hz with adaptive sync, but it never worked with amdgpu. ---- The DisplayPort/HDMI sound with DC/DAL support in 4.15 is a myth and NEVER works. If I configure amdgpu.dc=1 with RX580 it simply does not sound anything and with the RX480 it hangs the system when starting the pavucontrol. When forcing the output to the HDMI/DP it simply does not sound anything in both ways (but pavucontrol shows that something was supposed to be happening). ---- While running a tty the chances of crashing is very low. But it happens when trying an opencl application after some random time as said before. ---- When using RX580+1070 or RX480+1070 for vfio I noticed that unbinding the nvidia card extended the amount of working time before crashing. (was also one reason for me to think that the PSU was faulty) Now the "best" part : running a single GPU leads to the same problems... :/ I'm not sure about anything right now. I'll try only the 1070 for sometime to guarantee that amdgpu is the only problem here. I never touched the amdgpu code but it seems to me that either I sell the cards or I fix it by hand. Because I'm not finding anything related.
You are receiving this mail because:
- You are the assignee for the bug.
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel