[amdgpu] Errors with amdgpu-pro 17.50 running on GX-424CC SOC

michel@xxxxxxxxxxx (Michel Dänzer) · Sun, 11 Mar 2018 16:20:30 +0100

On 2018-03-09 12:45 PM, Will Wagner wrote:
> Apologies if this is not the right list for this question. Kernel
> MAINTAINERS file suggests it is but please let me know if I should
> repost elsewhere.

The amd-gfx list is better, moving there.

> I have a custom OpenCL application running under Ubuntu 16.04.04, HWE
> Kernel 4.13 and amdgpu-pro 17.50 drivers. This is running on a Fujitsu
> D3313-S6 industrial mainboard
> (http://www.fujitsu.com/fts/products/computing/peripheral/mainboards/industrial-mainboards/d3313s.html)
> 
> 
> After a period of running - from 5 minutes to 48 hours we begin to see
> these kernel traces. At some point after seeing these errors the
> application fails.
> 
> [Â Â  99.348774] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014
> [Â Â  99.355041] amdgpu 0000:00:01.0:Â Â  VM_CONTEXT1_PROTECTION_FAULT_ADDR
> Â 0x00103042
> [Â Â  99.362509] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x09020014
> [Â Â  99.369980] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0'
> (0x54433000) (32)
> [Â  100.437547] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014
> [Â  100.443811] amdgpu 0000:00:01.0:Â Â  VM_CONTEXT1_PROTECTION_FAULT_ADDR
> Â 0x00103042
> [Â  100.451288] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x09020014
> [Â  100.458758] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0'
> (0x54433000) (32)
> 
> I know from searching the web that this error can appear if there are
> errors in the opencl program. However we have run the exact same program
> on multiple other hardware configurations and have not seen problems. On
> linux we have had success running on all machines tested with a discrete
> amd gpu, just not on the GX-424CC apu. On windows we have had the code
> running on a large numbers of platforms including the GX-424CC without
> issues.
> 
> I'm prepared to believe we have an error in our opencl code, but have no
> clue where to start looking. What does the error actually mean and why
> does it happen? Is it to do with buffer transfers between host and
> device? During execution of a kernel?
> 
> Whilst attempting to investigate the problem I have tried a number of
> kernel arguments for the driver. If I reduce the amount of memory
> assigned to vram with vramlimit=64 then it appears to take longer for
> the error to occur.
> 
> If I run it with the arguments vm_debug=1 vm_fault_stop=1 the error no
> longer appears. I would have expected it to occur at least once due to
> vm_fault_stop=1 but it does not. However instead I get this error
> occasionally:
> 
> [ 7612.741693] amdgpu 0000:00:01.0: IH ring buffer overflow (0x00000010,
> 0x00000000, 0x00000020)
> 
> 
> So is this a bug in the driver or in the opencl code? How can I progress
> debugging this issue?
> 
> Thanks
> Will
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Earthling Michel DÃ¤nzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer