Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"

Julien Isorce <jisorce@xxxxxxxxxx> · Thu, 23 Mar 2017 09:26:24 +0000

Hi Michel,
When it happens, the main thread of our gl based app is stuck on a ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but same thing happens if true. Other threads are only si_shader:0,1,2,3 and are doing nothing, just waiting for jobs. I can also do sudo gdb -p $(pidof Xorg) to block the X11 server, to make sure there is no ping pong between 2 processes. All other processes are not loading dri/radeonsi_dri.so . And adding a few traces shows that the above ioctl call is looping for ever on https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819 and comes from mesa https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/winsys/radeon/drm/radeon_drm_cs.c#n454 . 

After adding even more traces I can see that the bo, which is being indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS.
And it gets 3 potential placements after calling "radeon_evict_flags". 
 1: VRAM cpu inaccessible, fpfn is 65536
 2: VRAM cpu accessible, fpfn is 0
 3: GTT, fpfn is 0

And it looks like it continuously succeeds to move on the second placement. So I might be wrong but it looks it is not even a ping pong between VRAM accessible / not accessible, it just keeps being blited in the CPU accessible part of the VRAM.

Maybe radeon_evict_flags should just not add the second placement if its current placement is already VRAM cpu accessible.
Or could be a bug in the get_node that should not succeed in that case.

Note that this happens when the VRAM is nearly full.

FWIW I noticed that amdgpu is doing something different: https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c#L205
vs
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/radeon/radeon_ttm.c#L198 

Finally the NMI watchdog and the kernel soft lockup and hard lockup detectors do not detect this looping in that ioctl(RADEON_CS). Maybe because it estimates it is doing real work. Same for radeon_lockup_timeout, it does not detect it.

The gpu is a FirePro W600 Cape Verde 2048M.

Thx
Julien

On Thu, Mar 23, 2017 at 8:10 AM, Michel Dänzer <michel@xxxxxxxxxxx> wrote:
On 23/03/17 03:19 AM, Zachary Michaels wrote:

> We were experiencing an infinite loop due to VRAM bos getting added back

> to the VRAM lru on eviction via ttm_bo_mem_force_space,

Can you share more details about what happened? I can imagine that

moving a BO from CPU visible to CPU invisible VRAM would put it back on

the LRU, but next time around it shouldn't hit this code anymore but get

evicted to GTT directly.

Was userspace maybe performing concurrent CPU access to the BOs in question?

> and reverting this commit solves the problem.

I hope we can find a better solution.

--

Earthling Michel Dänzer               |               http://www.amd.com

Libre software enthusiast             |             Mesa and X developer

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel