Re: [infinite loop] need some info about TTM's buffer eviction mechanism

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



And then try again (until ?).
The LRU is empty.

See you got one LRU per domain, so while evicting the buffer from VRAM it is moved to the GTT domain and also removed from the LRU domain.

When no other task is trying to do a CS the LRU will sooner or later become empty.

One possibility what happens here is that another process/thread is moving buffers back in while the first process is trying to evict them.

Regards,
Christian.

Am 14.03.2017 um 17:31 schrieb Julien Isorce:
Hello,

While debugging a softlock that happens on an ioctl(RADEON_CS), I found that it keeps looping indefinitely in the following loop: 

That would be great if someone could explain the logic behind this loop iteration. My understanding is that it tries to get a free node to put the current buffer object by calling "ttm_bo_man_get_node". If it fails with mem->mm_node as NULL (internally -ENOSPC) then it tries to evict another buffer from the LRU by calling "ttm_mem_evict_first". And then try again (until ?).

For some reasons, after some points while running an app that GL upload a lot of images, these 2 functions keeps returning 0 with mem->mm_node as NULL so the "while (true)" keeps looping indefinitely. Which results in the process to be stuck in that ioctl for ever.

A nasty workaround is to break the loop after a threshold for the number of iterations. It looks like it very rarely goes over 200. So breaking if > 200 iteration and returning -ENOMEM allows the application to get the hand back instead of being stuck. This is quite helpful for the debugging phase but definitely not a proper fix.

A colleague found that changing ttm_bo_unreserve by __ttm_bo_unreserve here https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L751 fixes this softlock. Because the later does not re-add the evicted buffer to the LRU.
But we are unsure whether this is a proper fix or just a workaround, providing this line exists since the first TTM commit in 2009. Any comment ?

Also it looks like there is a recursion from:

radeon_cs_ioctl
radeon_cs_parser_relocs
radeon_bo_list_validate
ttm_bo_validate
ttm_bo_move_buffer
ttm_bo_mem_space  @
ttm_bo_mem_force_space
ttm_mem_evict_first
ttm_bo_evict
ttm_bo_mem_space @
ttm_mem_evict_first
...

It looks it is meant to work like this but this make it complicated to follow. So any input would be much appreciated. Especially about the eviction mechanism + bo->evicted flag and how TTM manages the LRU for corner cases like when the VRAM is full.

I tried kernel 4.4, 4.8 and git HEAD from last week.

Thx
Julien








_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel


_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux