Re: [PATCH v2] drm/i915/gt: update request engine before removing virtual GuC engine

John Harrison <john.c.harrison@xxxxxxxxx> · Mon, 24 Jul 2023 12:40:10 -0700

On 7/19/2023 05:43, Tvrtko Ursulin wrote:
On 19/07/2023 11:41, Andrzej Hajda wrote:
On 18.07.2023 17:48, Tvrtko Ursulin wrote:
On 17/07/2023 19:03, John Harrison wrote:
On 7/13/2023 05:11, Tvrtko Ursulin wrote:
On 13/07/2023 12:09, Andrzej Hajda wrote:
Hi,

On 13.07.2023 09:39, Tvrtko Ursulin wrote:
On 12/07/2023 19:54, John Harrison wrote:
On 7/12/2023 09:27, Andrzej Hajda wrote:
On 12.07.2023 14:35, Tvrtko Ursulin wrote:
On 12/07/2023 13:18, Andrzej Hajda wrote:
On 11.07.2023 17:27, Tvrtko Ursulin wrote:
On 11/07/2023 14:58, Andrzej Hajda wrote:
On 11.07.2023 13:34, Andi Shyti wrote:
Hi Andrzej,

drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 
+++++++++++
          1 file changed, 11 insertions(+)

         diff --git 
a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
         index a0e3ef1c65d246..2c877ea5eda6f0 100644
         --- 
a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
         +++ 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
         @@ -3461,6 +3461,8 @@ static void 
guc_prio_fini(struct i915_request *rq, struct 
intel_context *ce)
          static void remove_from_context(struct 
i915_request *rq)
          {
                 struct intel_context *ce = 
request_to_scheduling_context(rq);
         +       struct intel_engine_cs *engine;
         +       intel_engine_mask_t tmp;

GEM_BUG_ON(intel_context_is_child(ce));

         @@ -3478,6 +3480,15 @@ static void 
remove_from_context(struct i915_request *rq)

atomic_dec(&ce->guc_id.ref);
i915_request_notify_execute_cb_imm(rq);
         +
         +       /*
         +        * GuC virtual engine can disappear 
after this call, so let's assign
         +        * something valid, as driver expects 
this to be always valid pointer.
         +        */
         + for_each_engine_masked(engine, 
rq->engine->gt, rq->execution_mask, tmp) {
         +               rq->engine = engine;

     yes... here the context might lose the virtual 
engine... I wonder
     whether this is the rigth solution, though. Maybe 
we should set
     rq->engine = NULL; and check for NULL? Don't know.

Setting NULL causes occasional null page de-reference in

i915_request_wait_timeout:

mutex_release(&rq->engine->gt->reset.mutex.dep_map, 
_THIS_IP_)

rq->engine after removing rq from context is (IMHO) used 
as a set of aliases
for gt and i915 (despite rq itself contains the alias to 
i915).
without investigating further, but maybe that code is not 
even
supposed to be executed, at this point, if the request's 
assigned
virtual engine is removed.

Real tests show it is executed and the function 
i915_request_wait_timeout is quite generic
I guess it is quite typical use-case, the only question is 
about timings - what happens earlier -
finalization of i915_request_wait_timeout or context removal.

The other point rq->engine is accessed after context 
removal is i915_fence_release -
there is long comment there regarding virtual context and 
reuse retired rq.
Anyway calling there "intel_engine_is_virtual(rq->engine)" 
is risky without this patch and KASAN complains clearly 
about it:
http://gfx-ci.igk.intel.com/tree/drm-tip/kasan.html?testfilter=gem_exec_balancer 

Looks like a bug introduced in bcb9aa45d5a0 ("Revert 
"drm/i915: Hold reference to intel_context over life of 
i915_request""), which was a partial revert of 1e98d8c52ed5 
("drm/i915: Hold reference to intel_context over life of 
i915_request").

Ie. if 1e98d8c52ed5 recognised the problem with 
disappearing rq->engine, then I am confused how 
bcb9aa45d5a0 left the rq->engine dereference in there after 
removing the extra reference.

Could it be that the intel_engine_is_virtual check simply 
needs to be removed from i915_fence_release, restoring 
things to how they were before 1e98d8c52ed5? Could you try 
it out?

I have already tried something similar [1] and KASAN bugs 
disappeared, or more precisely gem_exec_balance tests 
passed. But I have been warned by Nirmoy guc virtual engines 
can be created for only one real engine (ie. 
is_power_of_2(rq->execution_mask) is true but rq->engine 
points to virtual engine).

[1]: https://patchwork.freedesktop.org/series/118879/

Ugh.. Try involving media umd folks to see if they need a 
single engine virtual engine? Or we could always just not 
create it in the driver, I mean just use the physical one.

In case there is single physical engine 
intel_engine_create_virtual falls back to intel_context_create 
(no virtual engine), but in case of parallel contexts there is 
special KMD flag FORCE_VIRTUAL which enforces virtual engine 
even for single physical engine. So it seems to be KMD concept.

Anyway is it worth investigating how to make 
"is_power_of_2(rq->execution_mask)" indication of dangling 
engine pointer? It will not help in 1st case:
mutex_release(&rq->engine->gt->reset.mutex.dep_map, _THIS_IP_)

There seems to be a fundamental problem here. Object 1 (rq) is 
holding a pointer to a reference counted and transient object 2 
(engine) but without taking a reference count for itself. That 
is a Bad Thing(tm).

Engine is not ref counted (at least directly), hardware engines 
seems to be even persistent across whole life of i915.
I guess before introduction of virtual engines the assumption 
about validity of rq->engine was correct and developers used it 
to access rq->engine->gt, rq->engine->i915, etc
So the problems described here are probably leftovers of this 
change.
After virtual engines were introduced 
"is_power_of_2(rq->execution_mask)" was used to detect rq->engine 
is virtual.
And after adding parallel engines it does not seem to be valid 
check anymore.

I'm not following the description in the revert patch as to why 
rq can't ref count the context/engine. Is there actually a 
recursive counting problem? Or is it just a lifetime issue 
caused by requests hanging around indefinitely because they are 
locked by a user process?

Either way, jumping through convoluted hoops to ensure the code 
does not attempt to dereference a dangling pointer seems like 
the wrong fix. Removing the engine pointer when the request is 
completed and no longer dependent upon an engine (but before 
the engine can possibly be destroyed) seems like a much better 
solution. And then making the request handling code check for 
and cope with a null engine pointer. It sounds like the only 
problem there is the above mutex, but there is an alternate 
route to that? Although why a completed request would need 
access to a GT reset mutex seems confusing. If the request is 
done, then what connection does it still have to the GT?

Agreed in principle but the question is how invasive would it be 
to change the rules.

With the latest info that the issue is really just the GuC 
_parallel_ engine setup, and looking at the code, I wonder if we 
couldn't just flag the rq->flags with "kernel context request". 
The code in i915_fence_release claims the rq pool is only 
relevant for those so it sounds it would be safe to skip 
everything else based on that new flag.

For the mutex_release path, presumable the bad deref is only 
_after_ the wait, right? (Only once the request has been retired.)

In which case caching the gt pointer at the start of 
i915_request_wait_timeout would be sufficient.

That should be a few lines fixup overall and then the idea of 
allowing rq->engine to be reset to NULL can be explored more 
leisurely.

I guess:
- setting engine to NULL in remove_from_context,
- caching gt pointer,
- checking for null pointer in i915_fence_release

should be enough to solve current issue. However I am not sure if 
there are no more dragons hidden in other execution paths. I will 
try inspect the code.

What you have in this patch, cheat by replacing the engine, 
*might* work for the short term *if* you can make sure all 
parallel readers will see the updated rq->engine pointer in time, 
before the old one gets freed.

For the longer term solution - maybe making the engine reference 
counted?
That was the point of the original solution - having the request 
reference count the context. The context is what owns the virtual 
engine. So ensuring that the context can't be destroyed while there 
are requests outstanding on it ensured that rq->engine would always 
be valid. Nice simple and clean solution.So why did that get 
reverted? What is the problem with VM_BIND and having requests 
ensure that their resources stay alive for the lifetime of the 
request?

Don't ask me, but it perhaps it does read a bit vague from the 
commit message on its own:

commit bcb9aa45d5a0e11ef91245330c53cde214d15e8d (tag: 
intel/CI_DIGN_563)
Author: Niranjana Vishwanathapura <niranjana.vishwanathapura@xxxxxxxxx>
Date:   Wed Jun 15 00:13:47 2022 +0530

    Revert "drm/i915: Hold reference to intel_context over life of 
i915_request"
        This reverts commit 1e98d8c52ed5dfbaf273c4423c636525c2ce59e7.
        The problem with this patch is that it makes i915_request to 
hold a
    reference to intel_context, which in turn holds a reference on 
the VM.
    This strong back referencing can lead to reference loops which 
leads
    to resource leak.
        An example is the upcoming VM_BIND work which requires VM to 
hold
    a reference to some shared VM specific BO. But this BO's dma-resv
    fences holds reference to the i915_request thus leading to 
reference
    loop.
        v2:
      Do not use reserved requests for virtual engines

So I don't know what was the leak or was it permanent leak (?!) or not.

Given VM_BIND has been abandoned maybe this could even be 
unreverted. I don't see a problem with holding a reference for the 
request lifetime right now but could be wrong..

unrevert or alternatively hold reference to context only in case of 
virtual engines, as in this case their lifetime is the same?

IMO it is simpler to go for always, especially if we have doubts 
execlists virtual engines might have the same issue just harder to hit 
due the RCU free path. (I have doubts at least.)

It also probably does not make sense to have both 
intel_engine_is_virtual and is_power_of_2 checks in 
i915_fence_release. Since intel_engine_is_virtual will be safe with a 
reference, then is_power_of_2 hack can go. So not a direct un-revert, 
but un-revert with edits/improvements.
Sounds good to me. The 'power of 2 means virtual' thing does sounds 
quite hacky.

First thing though would be to get hold of Niranjana and John to bless 
the whole plan, given they were involved in the original reference 
counting and the revert.
I believe it was actually Matthew Brost that wrote the original 
reference counting patch. I recall that I reviewed it and it looked good 
to me at the time. I was not involved in the revert. I only realised the 
revert had happened because I saw this thread and got confused as to why 
there was a problem at all!

John.

Regards,

Tvrtko

Regards
Andrzej

Regards,

Tvrtko

John.

Or if setting rq->engine to NULL then evaluating the paths which 
derefence it. AFAIR these request retirement races have been 
generally tricky/annoying.

For instance under the i915_request_wait_timeout chain.

1)
__i915_spin_request - could be racy if request gets retired 
between i915_request_wait_timeout/dma_fence_is_signaled check and 
__i915_spin_request dereferencing rq->engine.props?

2)
intel_rps_boost - claims to be safe by serialising via 
i915_request_retire/I915_FENCE_FLAG_BOOST but is it? What prevents 
retire tearing down the engine between:

    if (!test_and_set_bit(I915_FENCE_FLAG_BOOST, &rq->fence.flags)) {

---> HERE - if whole retirement happens here <----

        struct intel_rps *rps = &READ_ONCE(rq->engine)->gt->rps;

3)
__intel_engine_flush_submission - could be racy to? What if the 
whole process of consuming the request by the backend and 
retirement happens between these two lines:

    if (i915_request_is_ready(rq))

---> HERE <---
        __intel_engine_flush_submission(rq->engine, false);

Then "is ready" can be true, but by the 2nd line request retired 
and rq->engine freed/NULL already.

Lets hope I am just making unwarranted panic by being to away from 
this area of the driver for a while. :) But if I am not, then it 
could be that rq->engine is just overall too problematic and 
perhaps we should have a look into making it reference counted by 
the request.

Btw there is rq->i915 but code only uses "rq->engine->i915" which 
way shall we go? remove former or latter?

Simpler/shorter option should be better.

Regards,

Tvrtko