Re: [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 16/09/2022 16:36, Ceraolo Spurio, Daniele wrote:


On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:

On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:

[snip]


+    /*
+     * If the context gets closed while the execbuf is ongoing, the context +     * close code will race with the below code to cancel the delayed work. +     * If the context close wins the race and cancels the work, it will +     * immediately call the sched disable (see guc_context_close), so there +     * is a chance we can get past this check while the sched_disable code +     * is being executed. To make sure that code completes before we check +     * the status further down, we wait for the close process to complete.
+     */
+    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
+        intel_context_sched_disable_unpin(ce);
+    else if (intel_context_is_closed(ce))
+        wait_for(context_close_done(ce), 1);

Comment makes it sounds important to handle the race, althought it
doesn't really explain the consequences. But most importantly, what if
close doesn't complete in 1ms?

will add the consequence (i believe the consequence is that we could prematurely read context flags bit indicating its gucid is still registered and after skipping
re-registration, find that context gets closed and guc-id freed).

Yes the 1 second is arbitrary and unnervingly short. Just spent sometime trying to

One millisecond even. :)

Normally 1ms is not a slow time for this. We can only hit the wait if the context_close call has already called the cancel_delayed_work, so the only thing left to do in that function is to send the H2G, which is relatively fast. However, I've just realized that if the H2G buffer is full the H2G will stall, so in that case it can take longer (maximum stall time before a hang is declared is 1.5s).


What would be the consequence of prematurely reading the still registered context flag? Execbuf fails? GuC hangs? Kernel crashes? Something else?

i915 thinks that a request pending with the GuC, while GuC thinks we disabled it (because the completion of the delayed_disable happens after the new request has been submitted). The heartbeat will still go through, so no reset will be triggered, the request is just lost. A new request submission on the same context should be able to recover it, but we didn't test that.



And why cant' this race with context close happen at any other point than this particular spot in guc_request_alloc? Immediately after the added checks? After atomic_add_unless?

The race is tied to the canceling of the delayed work. context_close only does something if it cancels the delayed work, so if the cancel_delayed_work_sync here does the cancelling then context_close is a no-op.

It's a bit involved* to follow so I'll give up and declare a "whatever" if that's okay with you.

*)
intel_context_close
  set_bit(CONTEXT_CLOSED_BIT, &ce->flags)
  ->guc_context_close
	if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
	    cancel_delayed_work(&ce->guc_state.sched_disable_delay))
		__delay_sched_disable(&ce->guc_state.sched_disable_delay.work);
		  ((which is:
			spin_lock_irqsave(&ce->guc_state.lock, flags);
			..stuff..
			unlock
		  ))
spin_lock_irqsave(&ce->guc_state.lock, flags);
	set_context_close_done(ce);
	spin_unlock_irqrestore(&ce->guc_state.lock, flags);

Takes and releases the same lock two times so I have no idea why twice, and less so whether it is safe and race free.

Regards,

Tvrtko



[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux