On 20.06.2016 10:24, Mads wrote: > On 2016-06-18 14:30, Nicolai Hähnle wrote: > >> The second approach is to correlate the VM ID in >> >>> dmesg: >>> [ 78.873577] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08e2b714 >>> [ 78.873590] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR >>> 0x0010151C >>> [ 78.873592] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS >>> 0x0D0B7014 >>> [ 78.873595] VM fault (0x14, vmid 6) at page 1053980, write from >>> 'SDM0' (0x53444d30) (183) >> >> with the running processes. This can be done via tracing. As root: >> >> echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_cs_ioctl/enable >> echo 1 > /sys/kernel/debug/tracing/events/gpu_sched/amd_sched_job/enable >> echo 1 > >> /sys/kernel/debug/tracing/events/amdgpu/amdgpu_sched_run_job/enable >> echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_vm_grab_id/enable >> cat /sys/kernel/debug/tracing/trace_pipe >> >> You'll get *lots* of output of the form >> >> compiz-2065 [000] .... 14927.891778: amdgpu_cs_ioctl: >> adev=ffff88022fe70000, sched_job=ffff880110dab2a0, first >> ib=ffff8800923e0200, sched fence=ffff880068509b80, ring name:gfx, >> num_ibs:1 >> compiz-2065 [000] .... 14927.891782: amd_sched_job: >> entity=ffff88023258f030, sched job=ffff880110dab2a0, >> fence=ffff880068509b80, ring=gfx, job count:0, hw job count:0 >> gfx-172 [002] .... 14927.891802: amdgpu_sched_run_job: >> adev=ffff88022fe70000, sched_job=ffff880110dab2a0, first >> ib=ffff8800923e0200, > sched fence=ffff880068509b80, ring name:gfx, >> num_ibs:1 >> gfx-172 [002] .... 14927.891809: amdgpu_vm_grab_id: >> vmid=5, ring=0 >> >> In this particular case, compiz submitted a CS (command stream), which >> was then asynchronously sent and processed on the gfx ring with vmid=5. >> >> The idea is to correlate the timestamps with those of the VM fault to >> see which process is at fault. If you do this, please send a bit more >> log context in attachments, because asynchronous execution can >> occasionally make the logs difficult to interpret. >> > > I made this script: > >> #!/bin/bash >> echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_cs_ioctl/enable >> echo 1 > /sys/kernel/debug/tracing/events/gpu_sched/amd_sched_job/enable >> echo 1 > >> /sys/kernel/debug/tracing/events/amdgpu/amdgpu_sched_run_job/enable >> echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_vm_grab_id/enable >> cat /sys/kernel/debug/tracing/trace_pipe >> carrizo.log & >> catpid=$! >> sudo -u htpc XAUTHORITY=/home/htpc/.Xauthority DISPLAY=:0 dolphin & >> dolphinpid=$! >> sleep 3 >> echo 0 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_cs_ioctl/enable >> echo 0 > /sys/kernel/debug/tracing/events/gpu_sched/amd_sched_job/enable >> echo 0 > >> /sys/kernel/debug/tracing/events/amdgpu/amdgpu_sched_run_job/enable >> echo 0 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_vm_grab_id/enable >> kill $catpid >> kill $dolphinpid > > Attaching the tracelog and dmesg, hope you can make sense of it :) Thanks for the effort. The apitrace of Dolphin is indeed "useless" -- seems like OpenGL is loaded, but in the end the app decides not to use it. Instead, it looks like the VM faults are coming from the X server. Can you make sure that the X server loads the debug build of radeonsi_dri.so with assertions enabled? I wonder if it's possible to get an apitrace from the X server. Perhaps you can reproduce the problem with Xephyr? If that also shows the VM faults, it would probably be easiest. Nicolai > > - Mads