Re: [PATCH v5 07/10] drm/i915/guc: Extract GuC error capture lists on G2H notification.

"Teres Alexis, Alan Previn" <alan.previn.teres.alexis@xxxxxxxxx> · Thu, 24 Feb 2022 23:21:19 -0800

Folks, a brief update: over the last few weeks of internal reviews,
testing and debug, another redesign has been implemented for this
patch (the extraction of error capture information). When
experiencing back to back error capture notifications (as part of
multiple dependent engine resets), if a forced full GT reset comes
in at the same time (from either the heartbeat or the user forced
reset debugfs or the igt_core cleanup) GT reset code takes the reset
lock and later calls guc_submission_reset_prepare. This function
flushes the ct processing worker queue for handling G2H events.
intel_guc_capture gets called to extract new error capture data from
the guc-log buffer but is actually in the midst of a reset ..
this causes lockdep issues (the memory shrinker vs reset locks).
Checking for uc->reset_in_progress is racy. That said, the
extraction code (this patch) needs to be modified to never allocate
memory for the output 'engine-instance-capture' node. Redesign is
complete where a pool of blank nodes are allocated up front and
re-used through the life of the driver. That will be part of the
next rev.

..alan

On 2/17/2022 11:21 AM, Umesh Nerlige Ramappa wrote:
On Sun, Feb 13, 2022 at 11:47:00AM -0800, Teres Alexis, Alan Previn 
wrote:
Thanks Umesh for reviewing the patch.
Am fixing all the rest but a couple of comments.
Responses to the latter and other questions below:

...alan

> +enum intel_guc_state_capture_event_status {
> +   INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_SUCCESS = 0x0,
> +   INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE = 0x1,
> +};
> +
> +#define INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK 0x1

MASK is not needed. See below

Alan: Oh wait, actually the mask for the capture status is 0x000000FF
(above is a typo). I'll fix above  mask and shall not change the
code below because the upper 24 bits of the first dword of this msg
is not defined.

...

> +static int guc_capture_buf_cnt(struct __guc_capture_bufstate *buf)
> +{
> +   if (buf->rd == buf->wr)
> +           return 0;
> +   if (buf->wr > buf->rd)
> +           return (buf->wr - buf->rd);
> +   return (buf->size - buf->rd) + buf->wr;
> +}

Is this a circular buffer shared between GuC and kmd? Since the size is
a power of 2, the above function is simply:

Alan: not this is not a circular buffer, so I'll keep the above
version.
static u32 guc_capture_buf_count(struct __guc_capture_bufstate *buf)
{
      return (buf->wr - buf->rd) & (buf->size - 1);
}

...

> +static int
> +guc_capture_log_remove_dw(struct intel_guc *guc, struct 
__guc_capture_bufstate *buf,
> +                     u32 *dw)
> +{
> +   struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> +   int tries = 2;
> +   int avail = 0;
> +   u32 *src_data;
> +
> +   if (!guc_capture_buf_cnt(buf))
> +           return 0;
> +
> +   while (tries--) {
> +           avail = guc_capture_buf_cnt_to_end(buf);

Shouldn't this be avail = guc_capture_buf_cnt(buf)?

Alan : The "guc_capture_log_get_[foo]" functions only call above
guc_capture_log_remove_dw when there isnt sufficient space to
copy out an entire structure from the space between the read pointer
and the end of the subregion (before the wrap-around). Those function
would populate the structure dword by dword by calling above func.
(NOTE the buffer and all error capture output structs are dword
aligned). Thats why above function tries twice and resets buf->rd = 0
if we find no space left at the end of the subregion (i.e. need to
wrap around) - which can only be done by calling
"guc_capture_buf_cnt_to_end".

...

> +
> +   /* Bookkeeping stuff */
> +   guc->log_state[GUC_CAPTURE_LOG_BUFFER].flush += 
log_buf_state_local.flush_to_file;
> +   new_overflow = intel_guc_check_log_buf_overflow(guc,
> + &guc->log_state[GUC_CAPTURE_LOG_BUFFER],
> + full_count);

I am not sure how the overflow logic works here and whether it is
applicable to the error capture buffer. Is the guc log buffer one big
buffer where the error capture is just a portion of that buffer? If so,
is the wrap around applicable to just the errorcapture buffer or to the
whole buffer?

Alan: Yes, the guc log buffer is one big log buffer but there are 3 
independent
subregions within that are populated with different content and are used
in different ways and timings. Each guc-log subregion (general-logs,
crash-dump and error-capture) has it's own read and write pointers.

got it. I would also put this one detail in the commit message since 
it's not quickly inferred.

Also what is the wrap_offset field in struct guc_log_buffer_state?

Alan: This is the byte offset of a location in the subregion that is 
the 1st byte
after the last valid guc entry written by Guc firmware before a 
wraparound
was done. This would generate a tiny hole at the end of the subregion 
for better
cacheline alignment when flushing entries into the subregion. However,
the error-capture subregion is dword aligned and all of the output 
structures
used for error-capture are also dword aligned so this can never 
happen for the
error-capture subregion.

Makes sense, thanks for clarifying.

Umesh

.