[Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Bug ID 109403
Summary amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X
Product DRI
Version unspecified
Hardware x86-64 (AMD64)
OS Linux (All)
Status NEW
Severity normal
Priority medium
Component DRM/AMDgpu
Assignee dri-devel@lists.freedesktop.org
Reporter 1@provod.gl

I've been experiencing random GPU hangs since I upgraded to Threadripper about
a year ago.

Specs:
- Motherboard: ASUS Prime X399-A, all bios versions from stock until current
0808
- CPU: Threadripper 1950X, 32 threads
- GPU: MSI Radeon RX Vega 64 Air Boost 8G OC (was also happening on ASUS R9
Fury X on the same machine; this GPU was generally stable on previous box)
- Displays:
   - 2x DELL U2412M 1920x1200x60 (DP)
   - 1x ASUS MG279Q 2560x1440x144 (DP)
- Kernel versions: 4.20, 5.0-rc2 (has been happening since from at least 4.14;
earlier versions weren't tried).
- linux-firmware: 20181218
- Mesa: 18.3.1
- X: 1.20.3
- libdrm: 2.4.96
- Possibly relevant kernel options: amd_iommu=on
vfio-pci.ids=10de:1005,10de:0e1a,1912:0014,1106:3483 iommu=pt
vfio-pci.disable_vga=1 hpet=disable nohpet amdgpu.ppfeaturemask=0xfffd7fff
amdgpu.gpu_recovery=1 pcie_aspm=off

The problem manifests itself usually like this:
1. Screen suddenly freezes (sometimes it is possible to move mouse cursor for a
few seconds, but it will freeze eventually too)
2. GPU fan speeds up and remain high
3. Every process that talks to GPU freezes and becomes impossible to kill.
4. Can SSH into the machine and everything else besides the GPU works ok.
5. dmesg contains a message like this:
                [Jan21 00:03] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
gfx timeout, signaled seq=17188686, emitted seq=17188689
                [  +0.000032] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
Process information: process X pid 9315 thread X:cs0 pid 9335
        or with a bit more stuff happening before:
                [Jan18 19:43] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000003] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000002] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x0060153D
                [  +0.000005] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000002] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000002] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000002] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000002] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [Jan18 19:44] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
gfx timeout, signaled seq=40554, emitted seq=40556
                [  +0.000047] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
Process information: process superposition pid 11225 thread superposit:cs0 pid
11308
6. amdgpu reports near 100% cpu usage and high power draw, even it was
completely idle before the freeze.

If I enable amdgpu.gpu_recovery, then it tries to reset the GPU but fails most
of the time:
                [  +0.000005] amdgpu 0000:44:00.0: GPU reset begin!
                [ +10.230091] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR*
[CRTC:51:crtc-2] hw_done or flip_done timed out
                (there are no further logs)
        (I've seen it succesfully reset the GPU only *once*, and that obviously
required X restart)

These freezes happen pretty much randomly:
- Sometimes the GPU remains stable for weeks
- It will generally remain stable while just playing games or running
benchmarks like Unigine Superposition for many hours
- There have been a couple of freezes when just watching youtube using firefox
and not doing anything else
- It will sometimes freeze with GPU being completely idle (but outputs on),
while CPU is at 100%
- It will sometimes freeze when opening shadertoy shaders. Not specific ones,
just randomly.
- It will likely freeze within 1-2 hours of streaming using OBS:
                - XSHM is used to grab 2560x1440 screen at 60fps
                - image downscaled to 1080p60 using whatever OBS does
                - a bunch of minor stuff added to the frame
                - software encoding using x264 medium preset resulting in
10-30% CPU load
        - It can freeze both when doing live shader programming (and GPU is at
100% with heavy pathtracing compute), and when just editing text in vim.
        - It is still pretty random: sometimes it remains stable for a week of
2-4 hours of almost everyday streaming, but on some days it will freeze 2-3
times within one evening.

This would suggest a hardware issue, but strangely enough I have never
experienced this problem on Windows using the same PC. This also prevents me
from RMA because there's no plausible way reproduce the issue.

Other hardware is stable:
- CPU being 100% busy compiling some huge C++ codebases for hours remains
stable
- many-hours memtest doesn't show any errors
- there's also an NVidia GPU installed in this machine that is being passed
through to Windows running under qemu. This GPU is also stable under any load.
        - although it was throwing PCI AER errors into dmesg (without any other
symptoms). This is believed to be benign X399 issue, and is suppressed using
pcie_aspm=off kernel parameter
- Loading the entire system for 100% (simultaneously running GPU benchmarks on
host and vm, and also compiling something on CPU) generally doesn't trigger the
issue. Adding OBS to that likely does.
- Three different PSUs were used on this system, no behaviour difference.

Other things:
- Power management on Linux is significantly different from one on Windows.
        - on Windows idle means idle: all clocks and voltages are as low as pp
allows, power draw is ~20W
        - on Linux even idle (nothing is feeding GPU with any work) will have
slck at 3 (1138Mhz 1000mV) and mclk at 3 (max, 945MHz 1100mV), power draw is
40W
- I am unable to dump BIOS of this card properly on Linux:
        - Both /sys/kernel/debug/dri/0/amdgpu_vbios and
/sys/class/drm/card0/device/rom are truncated at 60928
        - Contents are different from what I could dump on Windows, e.g:
                @@ -1,6 +1,6 @@
                -00000000: 55aa 77e9 eb02 0000 0000 0000 0000 0000 
U.w.............
                -00000010: 0000 0000 0000 0000 9c02 0000 0000 4942 
..............IB
                -00000020: 4d9d ac8a 0000 0000 0000 0000 0000 0004 
M...............
                +00000000: 55aa 77e9 eb02 0000 00c0 0000 0000 0000 
U.w.............
                +00000010: 0000 0000 0044 0000 9c02 0000 0000 4942 
.....D........IB
                +00000020: 4d43 ac8a 0000 0000 0000 0000 0000 0004 
MC..............
                 00000030: 2037 3631 3239 3535 3230 0000 0000 0000  
761295520......
                 00000040: 0000 0000 0000 0000 7402 0000 0000 0000 
........t.......
                 00000050: 3132 2f31 322f 3137 2030 313a 3237 0000  12/12/17
01:27..
                @@ -38,13 +38,13 @@
                 00000250: 315f 4d42 415f 4131 5f48 424d 5f38 4742 
1_MBA_A1_HBM_8GB
                 00000260: 5f56 3336 3831 305c 636f 6e66 6967 2e68 
_V36810\config.h
                 00000270: 0000 0090 2800 0202 4154 4f4d 00c0 eb03 
....(...ATOM....
                -00000280: 1802 c102 6c01 1e04 0000 0000 6214 8036 
....l.......b..6
                +00000280: 1802 c102 6c01 1e04 0000 0030 6214 8036 
....l......0b..6
- Under/over-volting doesn't work: any however insignificant change to any of
the default voltages result in severe throttling, see
https://github.com/RadeonOpenCompute/ROCm/issues/681

Is there anything else I could try?
Is there a way to collect more info?

Links to (probably, superficially) similar problems:
- https://bugs.freedesktop.org/show_bug.cgi?id=105733
- https://bugs.freedesktop.org/show_bug.cgi?id=105819
- https://bugs.freedesktop.org/show_bug.cgi?id=109022
- https://bugs.freedesktop.org/show_bug.cgi?id=105251
- https://bugs.freedesktop.org/show_bug.cgi?id=108493
- https://github.com/RadeonOpenCompute/ROCm/issues/348


You are receiving this mail because:
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux