On 16/09/2022 16:36, Ceraolo Spurio, Daniele wrote:
On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:
On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
[snip]
+ /*
+ * If the context gets closed while the execbuf is ongoing,
the context
+ * close code will race with the below code to cancel the
delayed work.
+ * If the context close wins the race and cancels the work, it
will
+ * immediately call the sched disable (see guc_context_close),
so there
+ * is a chance we can get past this check while the
sched_disable code
+ * is being executed. To make sure that code completes before
we check
+ * the status further down, we wait for the close process to
complete.
+ */
+ if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
+ intel_context_sched_disable_unpin(ce);
+ else if (intel_context_is_closed(ce))
+ wait_for(context_close_done(ce), 1);
Comment makes it sounds important to handle the race, althought it
doesn't really explain the consequences. But most importantly, what if
close doesn't complete in 1ms?
will add the consequence (i believe the consequence is that we could
prematurely
read context flags bit indicating its gucid is still registered and
after skipping
re-registration, find that context gets closed and guc-id freed).
Yes the 1 second is arbitrary and unnervingly short. Just spent
sometime trying to
One millisecond even. :)
Normally 1ms is not a slow time for this. We can only hit the wait if
the context_close call has already called the cancel_delayed_work, so
the only thing left to do in that function is to send the H2G, which is
relatively fast. However, I've just realized that if the H2G buffer is
full the H2G will stall, so in that case it can take longer (maximum
stall time before a hang is declared is 1.5s).
What would be the consequence of prematurely reading the still
registered context flag? Execbuf fails? GuC hangs? Kernel crashes?
Something else?
i915 thinks that a request pending with the GuC, while GuC thinks we
disabled it (because the completion of the delayed_disable happens after
the new request has been submitted). The heartbeat will still go
through, so no reset will be triggered, the request is just lost. A new
request submission on the same context should be able to recover it, but
we didn't test that.
And why cant' this race with context close happen at any other point
than this particular spot in guc_request_alloc? Immediately after the
added checks? After atomic_add_unless?
The race is tied to the canceling of the delayed work. context_close
only does something if it cancels the delayed work, so if the
cancel_delayed_work_sync here does the cancelling then context_close is
a no-op.
It's a bit involved* to follow so I'll give up and declare a "whatever" if that's okay with you.
*)
intel_context_close
set_bit(CONTEXT_CLOSED_BIT, &ce->flags)
->guc_context_close
if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
cancel_delayed_work(&ce->guc_state.sched_disable_delay))
__delay_sched_disable(&ce->guc_state.sched_disable_delay.work);
((which is:
spin_lock_irqsave(&ce->guc_state.lock, flags);
..stuff..
unlock
))
spin_lock_irqsave(&ce->guc_state.lock, flags);
set_context_close_done(ce);
spin_unlock_irqrestore(&ce->guc_state.lock, flags);
Takes and releases the same lock two times so I have no idea why twice, and less so whether it is safe and race free.
Regards,
Tvrtko