Hi Michel, When it happens, the main thread of our gl based app is stuck on a ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but same thing happens if true. Other threads are only si_shader:0,1,2,3 and are doing nothing, just waiting for jobs. I can also do sudo gdb -p $(pidof Xorg) to block the X11 server, to make sure there is no ping pong between 2 processes. All other processes are not loading dri/radeonsi_dri.so . And adding a few traces shows that the above ioctl call is looping for ever on https://github.com/torvalds/linux/blob/master/drivers/gpu/ drm/ttm/ttm_bo.c#L819 and comes from mesa https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/winsys/radeon/drm/radeon_drm_cs.c#n454 . After adding even more traces I can see that the bo, which is being indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS. And it gets 3 potential placements after calling "radeon_evict_flags". 1: VRAM cpu inaccessible, fpfn is 65536 2: VRAM cpu accessible, fpfn is 0 3: GTT, fpfn is 0 And it looks like it continuously succeeds to move on the second placement. So I might be wrong but it looks it is not even a ping pong between VRAM accessible / not accessible, it just keeps being blited in the CPU accessible part of the VRAM. Maybe radeon_evict_flags should just not add the second placement if its current placement is already VRAM cpu accessible. Or could be a bug in the get_node that should not succeed in that case. Note that this happens when the VRAM is nearly full. FWIW I noticed that amdgpu is doing something different: https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c#L205 vs https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/radeon/radeon_ttm.c#L198 Finally the NMI watchdog and the kernel soft lockup and hard lockup detectors do not detect this looping in that ioctl(RADEON_CS). Maybe because it estimates it is doing real work. Same for radeon_lockup_timeout, it does not detect it. The gpu is a FirePro W600 Cape Verde 2048M. Thx Julien On Thu, Mar 23, 2017 at 8:10 AM, Michel Dänzer <michel at daenzer.net> wrote: > On 23/03/17 03:19 AM, Zachary Michaels wrote: > > We were experiencing an infinite loop due to VRAM bos getting added back > > to the VRAM lru on eviction via ttm_bo_mem_force_space, > > Can you share more details about what happened? I can imagine that > moving a BO from CPU visible to CPU invisible VRAM would put it back on > the LRU, but next time around it shouldn't hit this code anymore but get > evicted to GTT directly. > > Was userspace maybe performing concurrent CPU access to the BOs in > question? > > > > and reverting this commit solves the problem. > > I hope we can find a better solution. > > > -- > Earthling Michel Dänzer | http://www.amd.com > Libre software enthusiast | Mesa and X developer > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170323/7254caeb/attachment-0001.html>