[amd-gfx] AMD Carrizo - GPU fault detected: 146 0x0842b714

nhaehnle@xxxxxxxxx (Nicolai Hähnle) · Mon, 20 Jun 2016 11:09:18 +0200

On 20.06.2016 10:24, Mads wrote:
> On 2016-06-18 14:30, Nicolai HÃ¤hnle wrote:
>
>> The second approach is to correlate the VM ID in
>>
>>> dmesg:
>>> [   78.873577] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08e2b714
>>> [   78.873590] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
>>> 0x0010151C
>>> [   78.873592] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
>>> 0x0D0B7014
>>> [   78.873595] VM fault (0x14, vmid 6) at page 1053980, write from
>>> 'SDM0' (0x53444d30) (183)
>>
>> with the running processes. This can be done via tracing. As root:
>>
>> echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_cs_ioctl/enable
>> echo 1 > /sys/kernel/debug/tracing/events/gpu_sched/amd_sched_job/enable
>> echo 1 >
>> /sys/kernel/debug/tracing/events/amdgpu/amdgpu_sched_run_job/enable
>> echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_vm_grab_id/enable
>> cat /sys/kernel/debug/tracing/trace_pipe
>>
>> You'll get *lots* of output of the form
>>
>>           compiz-2065  [000] .... 14927.891778: amdgpu_cs_ioctl:
>> adev=ffff88022fe70000, sched_job=ffff880110dab2a0, first
>> ib=ffff8800923e0200, sched fence=ffff880068509b80, ring name:gfx,
>> num_ibs:1
>>           compiz-2065  [000] .... 14927.891782: amd_sched_job:
>> entity=ffff88023258f030, sched job=ffff880110dab2a0,
>> fence=ffff880068509b80, ring=gfx, job count:0, hw job count:0
>>              gfx-172   [002] .... 14927.891802: amdgpu_sched_run_job:
>> adev=ffff88022fe70000, sched_job=ffff880110dab2a0, first
>> ib=ffff8800923e0200, > sched fence=ffff880068509b80, ring name:gfx,
>> num_ibs:1
>>              gfx-172   [002] .... 14927.891809: amdgpu_vm_grab_id:
>> vmid=5, ring=0
>>
>> In this particular case, compiz submitted a CS (command stream), which
>> was then asynchronously sent and processed on the gfx ring with vmid=5.
>>
>> The idea is to correlate the timestamps with those of the VM fault to
>> see which process is at fault. If you do this, please send a bit more
>> log context in attachments, because asynchronous execution can
>> occasionally make the logs difficult to interpret.
>>
>
> I made this script:
>
>> #!/bin/bash
>> echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_cs_ioctl/enable
>> echo 1 > /sys/kernel/debug/tracing/events/gpu_sched/amd_sched_job/enable
>> echo 1 >
>> /sys/kernel/debug/tracing/events/amdgpu/amdgpu_sched_run_job/enable
>> echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_vm_grab_id/enable
>> cat /sys/kernel/debug/tracing/trace_pipe >> carrizo.log &
>> catpid=$!
>> sudo -u htpc XAUTHORITY=/home/htpc/.Xauthority DISPLAY=:0 dolphin &
>> dolphinpid=$!
>> sleep 3
>> echo 0 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_cs_ioctl/enable
>> echo 0 > /sys/kernel/debug/tracing/events/gpu_sched/amd_sched_job/enable
>> echo 0 >
>> /sys/kernel/debug/tracing/events/amdgpu/amdgpu_sched_run_job/enable
>> echo 0 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_vm_grab_id/enable
>> kill $catpid
>> kill $dolphinpid
>
> Attaching the tracelog and dmesg, hope you can make sense of it :)

Thanks for the effort. The apitrace of Dolphin is indeed "useless" -- 
seems like OpenGL is loaded, but in the end the app decides not to use 
it. Instead, it looks like the VM faults are coming from the X server.

Can you make sure that the X server loads the debug build of 
radeonsi_dri.so with assertions enabled?

I wonder if it's possible to get an apitrace from the X server. Perhaps 
you can reproduce the problem with Xephyr? If that also shows the VM 
faults, it would probably be easiest.

Nicolai

>
> - Mads