Re: [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 20 Sep 2022 15:33:09 +0100

On 16/09/2022 16:36, Ceraolo Spurio, Daniele wrote:

On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:

On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:

[snip]

+    /*
+     * If the context gets closed while the execbuf is ongoing, 
the context
+     * close code will race with the below code to cancel the 
delayed work.
+     * If the context close wins the race and cancels the work, it 
will
+     * immediately call the sched disable (see guc_context_close), 
so there
+     * is a chance we can get past this check while the 
sched_disable code
+     * is being executed. To make sure that code completes before 
we check
+     * the status further down, we wait for the close process to 
complete.
+     */
+    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
+        intel_context_sched_disable_unpin(ce);
+    else if (intel_context_is_closed(ce))
+        wait_for(context_close_done(ce), 1);

Comment makes it sounds important to handle the race, althought it
doesn't really explain the consequences. But most importantly, what if
close doesn't complete in 1ms?

will add the consequence (i believe the consequence is that we could 
prematurely
read context flags bit indicating its gucid is still registered and 
after skipping
re-registration, find that context gets closed and guc-id freed).

Yes the 1 second is arbitrary and unnervingly short. Just spent 
sometime trying to

One millisecond even. :)

Normally 1ms is not a slow time for this. We can only hit the wait if 
the context_close call has already called the cancel_delayed_work, so 
the only thing left to do in that function is to send the H2G, which is 
relatively fast. However, I've just realized that if the H2G buffer is 
full the H2G will stall, so in that case it can take longer (maximum 
stall time before a hang is declared is 1.5s).

What would be the consequence of prematurely reading the still 
registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
Something else?

i915 thinks that a request pending with the GuC, while GuC thinks we 
disabled it (because the completion of the delayed_disable happens after 
the new request has been submitted). The heartbeat will still go 
through, so no reset will be triggered, the request is just lost. A new 
request submission on the same context should be able to recover it, but 
we didn't test that.

And why cant' this race with context close happen at any other point 
than this particular spot in guc_request_alloc? Immediately after the 
added checks? After atomic_add_unless?

The race is tied to the canceling of the delayed work. context_close 
only does something if it cancels the delayed work, so if the 
cancel_delayed_work_sync here does the cancelling then context_close is 
a no-op.

It's a bit involved* to follow so I'll give up and declare a "whatever" if that's okay with you.

*)
intel_context_close
  set_bit(CONTEXT_CLOSED_BIT, &ce->flags)
  ->guc_context_close
	if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
	    cancel_delayed_work(&ce->guc_state.sched_disable_delay))
		__delay_sched_disable(&ce->guc_state.sched_disable_delay.work);
		  ((which is:
			spin_lock_irqsave(&ce->guc_state.lock, flags);
			..stuff..
			unlock
		  ))

	spin_lock_irqsave(&ce->guc_state.lock, flags);
	set_context_close_done(ce);
	spin_unlock_irqrestore(&ce->guc_state.lock, flags);

Takes and releases the same lock two times so I have no idea why twice, and less so whether it is safe and race free.

Regards,

Tvrtko