Thanks Matt for reviewing. Responses to the questions you had. will fix the rest on next rev. > > @@ -4013,10 +4016,11 @@ int intel_guc_error_capture_process_msg(struct intel_guc *guc, > > return -EPROTO; > > } > > > > - status = msg[0]; > > - drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status); > > + status = msg[0] & INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK; > > + if (status == INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE) > > + drm_warn(&guc_to_gt(guc)->i915->drm, "G2H-Error capture no space\n"); > > > > - /* Add extraction of error capture dump */ > > + intel_guc_capture_store_snapshot(guc); > > This is done in different worker, right? How does this not race with an > engine reset notification that does an error capture (e.g. the error > capture is done before we read out the info from the GuC)? > I believe the guc error state capture notification event comes right before context resets, not engine resets. Also, the i915_gpu_coredump subsystem gathers hw state in response to the a context hanging and is done prior to the hw reset. Therefore, engine reset notification doesnt play a role here. However, since the context reset notification is expected to come right after the error state capture notification and your argument is valid in this case... you could argue a race condition can exist when the context reset event leads to calling of i915_gpu_coredump subsystem (which in turn gets a pointer to the intel_guc_capture module), however even here, no actual parsing is done yet - i am currently waiting on the actual debugfs call before parsing any of the data. As a fix, However, I add a flush_work at the time the call to the parsing happens even later? > As far as I can tell 'intel_guc_capture_store_snapshot' doesn't allocate > memory so I don't think we need a worker here. > Yes, i am not doing any allocation but the worker function does lock the capture_store's mutex (to ensure we dont trample on the circular buffer pointers of the interim store (the one between the guc-log-buffer and the error-capture reporting). I am not sure if spin_lock_irqsave would be safe considering in the event we had back to back error captures, then we wouldnt want to be spinning that long if coincidentially the reporting side is actively parsing the bytestream output of the same interim buffer. After thinking a bit more, a lockless solution is possible, i could double buffer the interim-store-circular-buffer and so when the event comes in, i flip to the next "free" interim-store (that isnt filled with pending logs waiting to be read or being read). If there is no available interim-store, (i.e. we've already had 2 error state captures that have yet to be read/flushed), then we just drop the output to the floor. (this last part would be in line with the current execlist model.. unless something changed there in the last 2 weeks). However the simplest solution from with this series today, would be to flush_work much later at the time the debugfs calls to get the output error capture report (since that would be the last chance to resolve the possible race condition). I could call the flush_work earlier at the context_reset notification, but that too would be an irq handler path. > Matt > > > > > return 0; > > } > > -- > > 2.25.1 > >