Re: [RFC 6/7] drm/i915/guc: Copy new GuC error capture logs upon G2H notification.

"Teres Alexis, Alan Previn" <alan.previn.teres.alexis@xxxxxxxxx> · Wed, 8 Dec 2021 05:14:16 +0000

Thanks Matt for reviewing. Responses to the questions you had.
will fix the rest on next rev.

> > @@ -4013,10 +4016,11 @@ int intel_guc_error_capture_process_msg(struct intel_guc *guc,
> >  		return -EPROTO;
> >  	}
> >  
> > -	status = msg[0];
> > -	drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", status);
> > +	status = msg[0] & INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK;
> > +	if (status == INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE)
> > +		drm_warn(&guc_to_gt(guc)->i915->drm, "G2H-Error capture no space\n");
> >  
> > -	/* Add extraction of error capture dump */
> > +	intel_guc_capture_store_snapshot(guc);
> 
> This is done in different worker, right? How does this not race with an
> engine reset notification that does an error capture (e.g. the error
> capture is done before we read out the info from the GuC)?
> 
I believe the guc error state capture notification event comes right before
context resets, not engine resets. Also, the i915_gpu_coredump subsystem
gathers hw state in response to the a context hanging and is done prior to
the hw reset. Therefore, engine reset notification doesnt play a role here.
However, since the context reset notification is expected to come right after
the error state capture notification and your argument is valid in this case...
you could argue a race condition can exist when the context reset event leads
to calling of i915_gpu_coredump subsystem (which in turn gets a pointer to
the intel_guc_capture module), however even here, no actual parsing is done
yet - i am currently waiting on the actual debugfs call before parsing any
of the data. As a fix, However, I add a flush_work at the time the call to
the parsing happens even later?

> As far as I can tell 'intel_guc_capture_store_snapshot' doesn't allocate
> memory so I don't think we need a worker here.
> 
Yes, i am not doing any allocation but the worker function does lock the
capture_store's mutex (to ensure we dont trample on the circular buffer pointers
of the interim store (the one between the guc-log-buffer and the error-capture
reporting). I am not sure if spin_lock_irqsave would be safe considering in the
event we had back to back error captures, then we wouldnt want to be spinning that
long if coincidentially the reporting side is actively parsing the bytestream 
output of the same interim buffer.

After thinking a bit more, a lockless solution is possible, i could double
buffer the interim-store-circular-buffer and so when the event comes in, i flip
to the next "free" interim-store (that isnt filled with pending logs waiting
to be read or being read). If there is no available interim-store, (i.e.
we've already had 2 error state captures that have yet to be read/flushed), then
we just drop the output to the floor. (this last part would be in line with the
current execlist model.. unless something changed there in the last 2 weeks).

However the simplest solution from with this series today, would be to flush_work
much later at the time the debugfs calls to get the output error capture report
(since that would be the last chance to resolve the possible race condition).
I could call the flush_work earlier at the context_reset notification, but that too
would be an irq handler path. 

> Matt
> 
> >  
> >  	return 0;
> >  }
> > -- 
> > 2.25.1
> >