After chatting offline with Matt, it became apparent that i somehow missed the fact that the ctb processing handler was already in a work queue. That said, Matt is correct, i dont need to create a work queue to extract that capture log into the interim-store. That would eliminate the race condition (when the context-reset notification comes in later via the same queue). Thanks again Matt. ....alan On Tue, 2021-12-07 at 21:15 -0800, Alan Previn Teres Alexis wrote: > Thanks Matt for reviewing. Responses to the questions you had. > will fix the rest on next rev. > > > > @@ -4013,10 +4016,11 @@ int intel_guc_error_capture_process_msg(struct intel_guc *guc, > > > return -EPROTO; > > > } > > > > > > - status = msg[0]; > > > - drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status); > > > + status = msg[0] & INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK; > > > + if (status == INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE) > > > + drm_warn(&guc_to_gt(guc)->i915->drm, "G2H-Error capture no space\n"); > > > > > > - /* Add extraction of error capture dump */ > > > + intel_guc_capture_store_snapshot(guc); > > > > This is done in different worker, right? How does this not race with an > > engine reset notification that does an error capture (e.g. the error > > capture is done before we read out the info from the GuC)? > > > I believe the guc error state capture notification event comes right before > context resets, not engine resets. Also, the i915_gpu_coredump subsystem > gathers hw state in response to the a context hanging and is done prior to > the hw reset. Therefore, engine reset notification doesnt play a role here. > However, since the context reset notification is expected to come right after > the error state capture notification and your argument is valid in this case... > you could argue a race condition can exist when the context reset event leads > to calling of i915_gpu_coredump subsystem (which in turn gets a pointer to > the intel_guc_capture module), however even here, no actual parsing is done > yet - i am currently waiting on the actual debugfs call before parsing any > of the data. As a fix, However, I add a flush_work at the time the call to > the parsing happens even later? > > > > As far as I can tell 'intel_guc_capture_store_snapshot' doesn't allocate > > memory so I don't think we need a worker here. > > > Yes, i am not doing any allocation but the worker function does lock the > capture_store's mutex (to ensure we dont trample on the circular buffer pointers > of the interim store (the one between the guc-log-buffer and the error-capture > reporting). I am not sure if spin_lock_irqsave would be safe considering in the > event we had back to back error captures, then we wouldnt want to be spinning that > long if coincidentially the reporting side is actively parsing the bytestream > output of the same interim buffer. > > After thinking a bit more, a lockless solution is possible, i could double > buffer the interim-store-circular-buffer and so when the event comes in, i flip > to the next "free" interim-store (that isnt filled with pending logs waiting > to be read or being read). If there is no available interim-store, (i.e. > we've already had 2 error state captures that have yet to be read/flushed), then > we just drop the output to the floor. (this last part would be in line with the > current execlist model.. unless something changed there in the last 2 weeks). > > However the simplest solution from with this series today, would be to flush_work > much later at the time the debugfs calls to get the output error capture report > (since that would be the last chance to resolve the possible race condition). > I could call the flush_work earlier at the context_reset notification, but that too > would be an irq handler path. > > > Matt > > > > > > > > return 0; > > > } > > > -- > > > 2.25.1 > > >