[AMD Public Use]
Fixing the security tag...
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Bridgman, John <John.Bridgman@xxxxxxx>
Sent: March 8, 2020 3:10 PM To: Clemens Eisserer <linuxhippy@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options for decoding.
In MCE-Ryzen-Decoder docco the example is exactly the error you are seeing, with the same output, so guessing that is what you are using:
On the other hand I found a report on AMD forums where the same error is decoded by mce log as a generic error in a memory transaction, which seems to make more sense.
For something as simple as the GPU bus interface not responding to an access by the CPU I think you would get a different error (bus error) but not 100% sure about that.
My first thought would be to see if your mobo BIOS has an option to force PCIE gen3 instead of 4 and see if that makes a difference. There are some amdgpu module parms related to PCIE as well but I'm not sure which ones to recommend.
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Bridgman, John <John.Bridgman@xxxxxxx>
Sent: March 8, 2020 2:45 PM To: Clemens Eisserer <linuxhippy@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
[AMD Official Use Only - Internal Distribution Only]
The decoded MCE info doesn't look right... if the last bit is a zero I believe that means the watchdog timer is not enabled.
That said, I'm not sure how the decoder you found works, but it seems like a bit more information would be required than what you passed in. Can you point me to the program you used ?
Thanks,
John
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Clemens Eisserer <linuxhippy@xxxxxxxxx>
Sent: March 8, 2020 9:06 AM To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> Subject: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x? Hi there,
Right after Ryzen3xxx was available I built a new system consisting of: - Asrock Phantom Gaming 4 X570 (latest BIOS 2.3) - Ryzen 3700x (not overclocked) - MSI RX570 4GB - Larger CPU cooler, high quality PSU, etc... The system runs stable with Windows-10 (no reboot BSOD in months) and runs memtest86 (single/multicore) as well as various load-tests for hours without errors. However running Linux I get a spontaneous reboot every now and then (2-3x a week), with always the same machine check exception logged: [ 0.105003] .... node #0, CPUs: #1 #2 [ 0.107022] mce: [Hardware Error]: Machine check events logged [ 0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108 [ 0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC d012000100000000 SYND 4d000000 IPID 500b000000000 [ 0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1580717835 SOCKET 0 APIC 4 microcode 8701013 I've tried a lot of different CPU-related things, like disabling C6, disabling MWAIT use for task switching, etc without success. I tried two times to contact AMD support only asking them to please decode the MCE hex value - but as soon as they read over the term "linux" the basically abort any communication. And to be honest, I had the impression that they did not actually know what an MCE is in the first place. Luckily I found a decoder on github which prints: Bank: Execution Unit (EX) Error: Watchdog Timeout error (WDT 0x0) I was rather hopeless until I found the following reddit thread: https://nam11.safelinks.protection.outlook.com/?url=""> what the decoder logic is The users there claim to experience exactly the same problem (even with the same MCE-Code logged) but where using R600 based graphics cards - he is even using the same mainboard. When he swapped his R600-card with a new RX5700 the problems vanished. I don't have the luxury to simply try another GPU (my RX5700 is the only one properly driving my 4k@60Hz panel), however the whole observation makes me wonder. How can a GPU be responsible for low-level errors such as the machine check exception in the execution units like the one mentioned above. Could DMA transfers gone bad be the cluprit? Are there any "safe mode" options available I could try regarding amdgpu (I tried disabling low-power states but this didn't help and only made my GPU fans spin up)? Any help is highly appreciated. Thanks, Clemens _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://nam11.safelinks.protection.outlook.com/?url=""> |
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx