Hello,
While debugging a softlock that happens on an
ioctl(RADEON_CS), I found that it keeps looping indefinitely
in the following loop:
That would be great if someone could explain the logic
behind this loop iteration. My understanding is that it tries
to get a free node to put the current buffer object by calling
"ttm_bo_man_get_node". If it fails with mem->mm_node as
NULL (internally -ENOSPC) then it tries to evict another
buffer from the LRU by calling "ttm_mem_evict_first". And then
try again (until ?).
For some reasons, after some points while running an app
that GL upload a lot of images, these 2 functions keeps
returning 0 with mem->mm_node as NULL so the "while (true)"
keeps looping indefinitely. Which results in the process to be
stuck in that ioctl for ever.
A nasty workaround is to break the loop after a threshold
for the number of iterations. It looks like it very rarely
goes over 200. So breaking if > 200 iteration and returning
-ENOMEM allows the application to get the hand back instead of
being stuck. This is quite helpful for the debugging phase but
definitely not a proper fix.
But we are unsure whether this is a proper fix or just a
workaround, providing this line exists since the first TTM
commit in 2009. Any comment ?
Also it looks like there is a recursion from:
radeon_cs_ioctl
radeon_cs_parser_relocs
radeon_bo_list_validate
ttm_bo_validate
ttm_bo_move_buffer
ttm_bo_mem_space @
ttm_bo_mem_force_space
ttm_mem_evict_first
ttm_bo_evict
ttm_bo_mem_space @
ttm_mem_evict_first
...
It looks it is meant to work like this but this make it
complicated to follow. So any input would be much appreciated.
Especially about the eviction mechanism + bo->evicted flag
and how TTM manages the LRU for corner cases like when the
VRAM is full.
I tried kernel 4.4, 4.8 and git HEAD from last week.
Thx
Julien