Re: [infinite loop] need some info about TTM's buffer eviction mechanism

Christian König <deathsimple@xxxxxxxxxxx> · Tue, 14 Mar 2017 17:40:26 +0100



      And then try again (until ?).
      The LRU is empty.

      
      See you got one LRU per domain, so while evicting the buffer from
      VRAM it is moved to the GTT domain and also removed from the LRU
      domain.

      
      When no other task is trying to do a CS the LRU will sooner or
      later become empty.

      
      One possibility what happens here is that another process/thread
      is moving buffers back in while the first process is trying to
      evict them.

      
      Regards,

      Christian.

      
      Am 14.03.2017 um 17:31 schrieb Julien Isorce:

    
      Hello,
        

        While debugging a softlock that happens on an
          ioctl(RADEON_CS), I found that it keeps looping indefinitely
          in the following loop: 
        https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819
        

        That would be great if someone could explain the logic
          behind this loop iteration. My understanding is that it tries
          to get a free node to put the current buffer object by calling
          "ttm_bo_man_get_node". If it fails with mem->mm_node as
          NULL (internally -ENOSPC) then it tries to evict another
          buffer from the LRU by calling "ttm_mem_evict_first". And then
          try again (until ?).
        

        For some reasons, after some points while running an app
          that GL upload a lot of images, these 2 functions keeps
          returning 0 with mem->mm_node as NULL so the "while (true)"
          keeps looping indefinitely. Which results in the process to be
          stuck in that ioctl for ever.
        

        A nasty workaround is to break the loop after a threshold
          for the number of iterations. It looks like it very rarely
          goes over 200. So breaking if > 200 iteration and returning
          -ENOMEM allows the application to get the hand back instead of
          being stuck. This is quite helpful for the debugging phase but
          definitely not a proper fix.
        

        A colleague found that changing ttm_bo_unreserve by
          __ttm_bo_unreserve here https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L751
          fixes this softlock. Because the later does not re-add the
          evicted buffer to the LRU.
        But we are unsure whether this is a proper fix or just a
          workaround, providing this line exists since the first TTM
          commit in 2009. Any comment ?
        

        Also it looks like there is a recursion from:
        

          radeon_cs_ioctl
          radeon_cs_parser_relocs
          radeon_bo_list_validate
          ttm_bo_validate
          ttm_bo_move_buffer
          ttm_bo_mem_space  @
          ttm_bo_mem_force_space
          ttm_mem_evict_first
          ttm_bo_evict
          ttm_bo_mem_space @
        
        ttm_mem_evict_first
        ...
        

        It looks it is meant to work like this but this make it
          complicated to follow. So any input would be much appreciated.
          Especially about the eviction mechanism + bo->evicted flag
          and how TTM manages the LRU for corner cases like when the
          VRAM is full.
        

        I tried kernel 4.4, 4.8 and git HEAD from last week.
        

        Thx
        Julien
        

      _______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

    
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel