On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote: > On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote: > > On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote: > > > > The flow of events are as below: > > > > > > > > 1. guc sends notification that an error capture was done and ready to take. > > > > - at this point we copy the guc error captured dump into an interim store > > > > (larger buffer that can hold multiple captures). > > > > 2. guc sends notification that a context was reset (after the prior) > > > > - this triggers a call to i915_gpu_coredump with the corresponding engine-mask > > > > from the context that was reset > > > > - i915_gpu_coredump proceeds to gather entire gpu state including driver state, > > > > global gpu state, engine state, context vmas and also engine registers. For the > > > > engine registers now call into the guc_capture code which merely needs to verify > > > > that GuC had already done a step 1 and we have data ready to be parsed. > > > > > > What about the time between the actual reset and receiving the context > > > reset notification? Latter will contain intel_context->guc_id - can that > > > be re-assigned or "retired" in between the two and so cause problems for > > > matching the correct (or any) vmas? > > > > > Not it cannot because its only after the context reset notification that i915 starts > > taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens. > > That's what i've observed in the code flow. > > The fact it is "only after" is exactly why I asked. > > Reset notification is in a CT queue with other stuff, right? So can be > some unrelated time after the actual reset. Could have context be > retired in the meantime and guc_id released is the question. > > Because i915 has no idea there was a reset until this delayed message > comes over, but it could see user interrupt signaling end of batch, > after the reset has happened, unbeknown to i915, right? > > Perhaps the answer is guc_id cannot be released via the request retire > flows. Or GuC signaling release of guc_id is a thing, which is then > ordered via the same CT buffer. > > I don't know, just asking. > As long as the context is pinned, the guc-id wont be re-assigned. After a bit of offline brain-dump from John Harrison, there are many factors that can keep the context pinned (recounts) including new or oustanding requests. So a guc-id can't get re-assigned between a capture-notify and a context-reset even if that outstanding request is the only refcount left since it would still be considered outstanding by the driver. I also think we may also be talking past each other in the sense that the guc-id is something the driver assigns to a context being pinned and only the driver can un-assign it (both assigning and unasigning is via H2G interactions). I get the sense you are assuming the GuC can un-assign the guc-id's on its own - which isn't the case. Apologies if i mis-assumed. > Regards, > > Tvrtko