Re: [PATCH 1/3] drm/i915: audit bo->resource usage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 31/08/2022 10:38, Christian König wrote:
Am 31.08.22 um 11:26 schrieb Matthew Auld:
On 31/08/2022 09:16, Christian König wrote:
Hi Matthew,

Am 30.08.22 um 12:45 schrieb Matthew Auld:
Hi,

On 30/08/2022 08:33, Christian König wrote:
Hi guys,

can we get an rb/acked-by for this i915 change?

Basically we are just making sure that the driver doesn't crash when bo->resource is NULL and a bo doesn't have any backing store assigned to it.

The Intel CI seems to be happy with this change, so I'm pretty sure it is correct.

It looks like DG2/DG1 (which happen to use TTM here) are no longer loading the module:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&data=05%7C01%7Cchristian.koenig%40amd.com%7Ccaca567b3279492450fd08da8b32e598%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637975347967354305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=apanfOjzSWD2vduINzr2j6F2DiIBC93hLRnnGJcGQ5o%3D&reserved=0?

According to the logs the firmware is failing to load, so perhaps related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare users. See below.


Thanks,
Christian.

Am 24.08.22 um 16:23 schrieb Luben Tuikov:
From: Christian König <christian.koenig@xxxxxxx>

Make sure we can at least move and alloc TT objects without backing store.

Signed-off-by: Christian König <christian.koenig@xxxxxxx>
---
  drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
  drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
  2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index bc9c432edffe03..45ce2d1f754cc4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
  {
      struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
                               bdev);
-    struct ttm_resource_manager *man =
-        ttm_manager_type(bo->bdev, bo->resource->mem_type);
      struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
      unsigned long ccs_pages = 0;
      enum ttm_caching caching;
@@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
      if (!i915_tt)
          return NULL;
-    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
-        man->use_tt)
+    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
+        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
          page_flags |= TTM_TT_FLAG_ZERO_ALLOC;

AFAICT i915 was massively relying on everything starting out in a "dummy" system memory resource (ttm_tt), where it then later "transitions" to the real resource. And if we need to clear the memory we rely on ZERO_ALLOC being set before calling into the i915_ttm_move() callback (even when allocating local-memory).

For ttm_bo_type_device objects (userspace stuff) it looks like this was previously handled by ttm_bo_validate() always doing:

  ret = ttm_tt_create(bo, true); /* clear = true */

Which we would always hit since the resource was always "compatible" for the dummy case. But it looks like this is no longer even called, since we can now call into ttm_move with bo->resource == NULL, which still calls tt_create eventually, which not always with clear = true.

All other objects are then ttm_bo_type_kernel so we don't care about clearing, except in the rare case of ALLOC_CPU_CLEAR, which was handled as per above in i915_ttm_tt_create(). But I think here bo->resource is NULL at the start when first creating the object, which will skip setting ZERO_ALLOC, which might explain the CI failure.

The other possible concern (not sure since CI didn't get that far) is around ttm_bo_pipeline_gutting(), which now leaves bo->resource = NULL. It looks like i915_ttm_shrink() was relying on that to unpopulate the ttm_tt. When later calling ttm_bo_validate(), i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , re-populate it and then potentially move it back to local-memory.

What are your thoughts here? Also sorry if i915 is making a bit of mess here.

First of all thanks a lot for taking a look. We previously got reports about strange crashes with this patch, but couldn't really reproduce them (even not by sending them out again). This explains that a bit.

The simplest solution would be to just change the && into a ||, e.g. when previously either no resource is allocated or the resource requires to use a tt we clear it.

That should give you the same behavior as before, but I agree that this is a bit messy.

Yeah, that should do the trick.

That hopefully just leaves i915_ttm_shrink(), which is swapping out shmem ttm_tt and is calling ttm_bo_validate() with empty placements to force the pipeline-gutting path, which importantly unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But AFAICT it looks like that will now also nuke the bo->resource, instead of just leaving it in system memory. My assumption is that when later calling ttm_bo_validate(), it will just do the bo_move_null() in i915_ttm_move(), instead of re-populating the ttm_tt and then potentially copying it back to local-memory?

Well you do ttm_bo_validate() with something like GTT domain, don't you? This should result in re-populating the tt object, but I'm not 100% sure if that really works as expected.

AFAIK for domains we either have system memory (which uses ttm_tt and might be shmem underneath) or local-memory. But perhaps i915 is doing something wrong here, or abusing TTM in some way. I'm not sure tbh.

Anyway, I think we have two cases here:

- We have some system memory only object. After doing i915_ttm_shrink(), bo->resource is now NULL. We then call ttm_bo_validate() at some later point, but here we don't need to copy anything, but it also looks like ttm_bo_handle_move_mem() won't populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of this, but now we just call ttm_bo_move_null().

- We have a local-memory only object, which was evicted to shmem, and then swapped out by the shrinker like above. The bo->resource is NULL. However this time when calling ttm_bo_validate() we need to actually do a copy in i915_ttm_move(), as well as re-populate the ttm_tt. i915_ttm_move() was taking care of this, but now we just call ttm_bo_move_null().

Perhaps i915 is doing something wrong in the above two cases?


Thanks,
Christian.



I've been considering to replacing the ttm_bo_type with a bunch of behavior flags for a bo. I'm hoping that this will clean things up a bit.

Regards,
Christian.


      caching = i915_ttm_select_tt_caching(obj);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index 9a7e50534b84bb..c420d1ab605b6f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
      bool clear;
      int ret;
-    if (GEM_WARN_ON(!obj)) {
+    if (GEM_WARN_ON(!obj) || !bo->resource) {
          ttm_bo_move_null(bo, dst_mem);
          return 0;
      }






[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux