Re: [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
there is little benefit to even making that guc-id lookup because:

1. the delay between the context reset notification (when the vmas are copied
   and when we verify we had received a guc err capture dump) may be subjectively
   large enough and not tethered that the guc-id may have already been re-assigned.

2. I was really looking for some kind of unique context handle to print out that could
   be correlated (by user inspecting the dump) back to a unique app or process or
   context-id but cant find such a param in struct intel_context.

As part of further reviewing the end to end flows and possible error scenarios, there
also may potentially be a mismatch between "which context was reset by guc at time-n"
vs "which context's vma buffers is being printed out at time-n+x" if
we are experiencing back-to-back resets and the user dumped the debugfs x-time later.

(Recap: First, guc notifies capture event, second, guc notifies context reset during
which we trigger i915_gpu_coredump. In this second step, the vma's are dumped and we
verify that the guc capture happened but don't parse the guc-err-capture-logs yet.
Third step is when user triggers the debugfs to dump which is when we parse the error
capture logs.)

As a fix, what we can do in the guc_error_capture report out is to ensure that
we dont re-print the previously dumped vmas if we end up finding multiple
guc-error-capture dumps since the i915_gpu_coredump would have only captured the vma's
for the very first context that was reset. And with guc-submission, that would always
correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the guc-error-capture
logs are large enough to hold data for multiple dumps).

The changes (removal of below-hunk and adding of only-print-the-first-vma") is trivial
but i felt it warranted a good explanation. Apologies for the inbox noise.

...alan

On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
> Thanks again for the detailed review here.
> Will fix all the rest on next rev.
> One special response for this one:
> 
> 
> On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
> > On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> > > +			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
> > > +				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
> > > +				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
> > > +				eng = guc_lookup_engine(guc, engineclass, eng_inst);
> > > +				if (eng) {
> > > +					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
> > > +				} else {
> > > +					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
> > > +				}
> > > +				ce = guc_context_lookup(guc, data.guc_ctx_id);
> > 
> > You are going to need to reference count the 'ce' here. See
> > intel_guc_context_reset_process_msg for an example. 
> > 
> 
> Oh crap - i missed this one - which you had explicitly mentioned offline when i was doing the
> development. Sorry about that i just totally missed it from my todo-notes.
> 
> ...alan





[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux