On 11.07.2023 13:34, Andi Shyti wrote:
Hi Andrzej,
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d246..2c877ea5eda6f0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3461,6 +3461,8 @@ static void guc_prio_fini(struct i915_request *rq, struct intel_context *ce)
static void remove_from_context(struct i915_request *rq)
{
struct intel_context *ce = request_to_scheduling_context(rq);
+ struct intel_engine_cs *engine;
+ intel_engine_mask_t tmp;
GEM_BUG_ON(intel_context_is_child(ce));
@@ -3478,6 +3480,15 @@ static void remove_from_context(struct i915_request *rq)
atomic_dec(&ce->guc_id.ref);
i915_request_notify_execute_cb_imm(rq);
+
+ /*
+ * GuC virtual engine can disappear after this call, so let's assign
+ * something valid, as driver expects this to be always valid pointer.
+ */
+ for_each_engine_masked(engine, rq->engine->gt, rq->execution_mask, tmp) {
+ rq->engine = engine;
yes... here the context might lose the virtual engine... I wonder
whether this is the rigth solution, though. Maybe we should set
rq->engine = NULL; and check for NULL? Don't know.
Setting NULL causes occasional null page de-reference in
i915_request_wait_timeout:
mutex_release(&rq->engine->gt->reset.mutex.dep_map, _THIS_IP_)
rq->engine after removing rq from context is (IMHO) used as a set of aliases
for gt and i915 (despite rq itself contains the alias to i915).
without investigating further, but maybe that code is not even
supposed to be executed, at this point, if the request's assigned
virtual engine is removed.
Real tests show it is executed and the function
i915_request_wait_timeout is quite generic
I guess it is quite typical use-case, the only question is about timings
- what happens earlier -
finalization of i915_request_wait_timeout or context removal.
The other point rq->engine is accessed after context removal is
i915_fence_release -
there is long comment there regarding virtual context and reuse retired rq.
Anyway calling there "intel_engine_is_virtual(rq->engine)" is risky
without this patch and KASAN complains clearly about it:
http://gfx-ci.igk.intel.com/tree/drm-tip/kasan.html?testfilter=gem_exec_balancer
Regards
Andrzej
Andi