On Fri, 2022-09-16 at 08:36 -0700, Ceraolo Spurio, Daniele wrote: > > alan: [snip] > > > > > + /* > > > > > + * If the context gets closed while the execbuf is ongoing, > > > > > the context > > > > > + * close code will race with the below code to cancel the > > > > > delayed work. > > > > > + * If the context close wins the race and cancels the work, it > > > > > will > > > > > + * immediately call the sched disable (see guc_context_close), > > > > > so there > > > > > + * is a chance we can get past this check while the > > > > > sched_disable code > > > > > + * is being executed. To make sure that code completes before > > > > > we check > > > > > + * the status further down, we wait for the close process to > > > > > complete. > > > > > + */ > > > > > + if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay)) > > > > > + intel_context_sched_disable_unpin(ce); > > > > > + else if (intel_context_is_closed(ce)) > > > > > + wait_for(context_close_done(ce), 1); > > > > > > > > Comment makes it sounds important to handle the race, althought it > > > > doesn't really explain the consequences. But most importantly, what if > > > > close doesn't complete in 1ms? > > > > > > will add the consequence (i believe the consequence is that we could > > > prematurely > > > read context flags bit indicating its gucid is still registered and > > > after skipping > > > re-registration, find that context gets closed and guc-id freed). > > > > > > Yes the 1 second is arbitrary and unnervingly short. Just spent > > > sometime trying to > > > > One millisecond even. :) > > Normally 1ms is not a slow time for this. We can only hit the wait if > the context_close call has already called the cancel_delayed_work, so > the only thing left to do in that function is to send the H2G, which is > relatively fast. However, I've just realized that if the H2G buffer is > full the H2G will stall, so in that case it can take longer (maximum > stall time before a hang is declared is 1.5s). > alan: I'll increase to 1.5 secs > > What would be the consequence of prematurely reading the still > > registered context flag? Execbuf fails? GuC hangs? Kernel crashes? > > Something else? > > i915 thinks that a request pending with the GuC, while GuC thinks we > disabled it (because the completion of the delayed_disable happens after > the new request has been submitted). The heartbeat will still go > through, so no reset will be triggered, the request is just lost. A new > request submission on the same context should be able to recover it, but > we didn't test that. > > > > And why cant' this race with context close happen at any other point > > than this particular spot in guc_request_alloc? Immediately after the > > added checks? After atomic_add_unless? > > The race is tied to the canceling of the delayed work. context_close > only does something if it cancels the delayed work, so if the > cancel_delayed_work_sync here does the cancelling then context_close is > a no-op. > alan: Then i'll add a warn - especially after a 1.5 sec delay and still waiting for that close.