[PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"

jisorce@xxxxxxxxxx (Julien Isorce) · Thu, 23 Mar 2017 09:26:24 +0000

Hi Michel,

When it happens, the main thread of our gl based app is stuck on a
ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but same
thing happens if true. Other threads are only si_shader:0,1,2,3 and are
doing nothing, just waiting for jobs. I can also do sudo gdb -p $(pidof
Xorg) to block the X11 server, to make sure there is no ping pong between 2
processes. All other processes are not loading dri/radeonsi_dri.so . And
adding a few traces shows that the above ioctl call is looping for ever on
https://github.com/torvalds/linux/blob/master/drivers/gpu/
drm/ttm/ttm_bo.c#L819 and comes from mesa
https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/winsys/radeon/drm/radeon_drm_cs.c#n454
.

After adding even more traces I can see that the bo, which is being
indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS.
And it gets 3 potential placements after calling "radeon_evict_flags".
 1: VRAM cpu inaccessible, fpfn is 65536
 2: VRAM cpu accessible, fpfn is 0
 3: GTT, fpfn is 0

And it looks like it continuously succeeds to move on the second placement.
So I might be wrong but it looks it is not even a ping pong between VRAM
accessible / not accessible, it just keeps being blited in the CPU
accessible part of the VRAM.

Maybe radeon_evict_flags should just not add the second placement if its
current placement is already VRAM cpu accessible.
Or could be a bug in the get_node that should not succeed in that case.

Note that this happens when the VRAM is nearly full.

FWIW I noticed that amdgpu is doing something different:
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c#L205
vs
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/radeon/radeon_ttm.c#L198

Finally the NMI watchdog and the kernel soft lockup and hard lockup
detectors do not detect this looping in that ioctl(RADEON_CS). Maybe
because it estimates it is doing real work. Same for radeon_lockup_timeout,
it does not detect it.

The gpu is a FirePro W600 Cape Verde 2048M.

Thx
Julien

On Thu, Mar 23, 2017 at 8:10 AM, Michel DÃ¤nzer <michel at daenzer.net> wrote:

> On 23/03/17 03:19 AM, Zachary Michaels wrote:
> > We were experiencing an infinite loop due to VRAM bos getting added back
> > to the VRAM lru on eviction via ttm_bo_mem_force_space,
>
> Can you share more details about what happened? I can imagine that
> moving a BO from CPU visible to CPU invisible VRAM would put it back on
> the LRU, but next time around it shouldn't hit this code anymore but get
> evicted to GTT directly.
>
> Was userspace maybe performing concurrent CPU access to the BOs in
> question?
>
>
> > and reverting this commit solves the problem.
>
> I hope we can find a better solution.
>
>
> --
> Earthling Michel DÃ¤nzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170323/7254caeb/attachment-0001.html>