On 2018-03-09 12:45 PM, Will Wagner wrote: > Apologies if this is not the right list for this question. Kernel > MAINTAINERS file suggests it is but please let me know if I should > repost elsewhere. The amd-gfx list is better, moving there. > I have a custom OpenCL application running under Ubuntu 16.04.04, HWE > Kernel 4.13 and amdgpu-pro 17.50 drivers. This is running on a Fujitsu > D3313-S6 industrial mainboard > (http://www.fujitsu.com/fts/products/computing/peripheral/mainboards/industrial-mainboards/d3313s.html) > > > After a period of running - from 5 minutes to 48 hours we begin to see > these kernel traces. At some point after seeing these errors the > application fails. > > [  99.348774] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014 > [  99.355041] amdgpu 0000:00:01.0:  VM_CONTEXT1_PROTECTION_FAULT_ADDR >  0x00103042 > [  99.362509] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x09020014 > [  99.369980] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0' > (0x54433000) (32) > [ 100.437547] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014 > [ 100.443811] amdgpu 0000:00:01.0:  VM_CONTEXT1_PROTECTION_FAULT_ADDR >  0x00103042 > [ 100.451288] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS > 0x09020014 > [ 100.458758] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0' > (0x54433000) (32) > > I know from searching the web that this error can appear if there are > errors in the opencl program. However we have run the exact same program > on multiple other hardware configurations and have not seen problems. On > linux we have had success running on all machines tested with a discrete > amd gpu, just not on the GX-424CC apu. On windows we have had the code > running on a large numbers of platforms including the GX-424CC without > issues. > > I'm prepared to believe we have an error in our opencl code, but have no > clue where to start looking. What does the error actually mean and why > does it happen? Is it to do with buffer transfers between host and > device? During execution of a kernel? > > Whilst attempting to investigate the problem I have tried a number of > kernel arguments for the driver. If I reduce the amount of memory > assigned to vram with vramlimit=64 then it appears to take longer for > the error to occur. > > If I run it with the arguments vm_debug=1 vm_fault_stop=1 the error no > longer appears. I would have expected it to occur at least once due to > vm_fault_stop=1 but it does not. However instead I get this error > occasionally: > > [ 7612.741693] amdgpu 0000:00:01.0: IH ring buffer overflow (0x00000010, > 0x00000000, 0x00000020) > > > So is this a bug in the driver or in the opencl code? How can I progress > debugging this issue? > > Thanks > Will > > _______________________________________________ > dri-devel mailing list > dri-devel at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer