Re: [PATCH v5 07/10] drm/i915/guc: Extract GuC error capture lists on G2H notification.

Umesh Nerlige Ramappa <umesh.nerlige.ramappa@xxxxxxxxx> · Thu, 17 Feb 2022 11:21:39 -0800

On Sun, Feb 13, 2022 at 11:47:00AM -0800, Teres Alexis, Alan Previn wrote:
Thanks Umesh for reviewing the patch.
Am fixing all the rest but a couple of comments.
Responses to the latter and other questions below:

...alan

> +enum intel_guc_state_capture_event_status {
> +   INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_SUCCESS = 0x0,
> +   INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE = 0x1,
> +};
> +
> +#define INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK      0x1

MASK is not needed. See below

Alan: Oh wait, actually the mask for the capture status is 0x000000FF
(above is a typo). I'll fix above  mask and shall not change the
code below because the upper 24 bits of the first dword of this msg
is not defined.

...

> +static int guc_capture_buf_cnt(struct __guc_capture_bufstate *buf)
> +{
> +   if (buf->rd == buf->wr)
> +           return 0;
> +   if (buf->wr > buf->rd)
> +           return (buf->wr - buf->rd);
> +   return (buf->size - buf->rd) + buf->wr;
> +}

Is this a circular buffer shared between GuC and kmd? Since the size is
a power of 2, the above function is simply:

Alan: not this is not a circular buffer, so I'll keep the above
version.
static u32 guc_capture_buf_count(struct __guc_capture_bufstate *buf)
{
      return (buf->wr - buf->rd) & (buf->size - 1);
}

...

> +static int
> +guc_capture_log_remove_dw(struct intel_guc *guc, struct __guc_capture_bufstate *buf,
> +                     u32 *dw)
> +{
> +   struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> +   int tries = 2;
> +   int avail = 0;
> +   u32 *src_data;
> +
> +   if (!guc_capture_buf_cnt(buf))
> +           return 0;
> +
> +   while (tries--) {
> +           avail = guc_capture_buf_cnt_to_end(buf);

Shouldn't this be avail = guc_capture_buf_cnt(buf)?

Alan : The "guc_capture_log_get_[foo]" functions only call above
guc_capture_log_remove_dw when there isnt sufficient space to
copy out an entire structure from the space between the read pointer
and the end of the subregion (before the wrap-around). Those function
would populate the structure dword by dword by calling above func.
(NOTE the buffer and all error capture output structs are dword
aligned). Thats why above function tries twice and resets buf->rd = 0
if we find no space left at the end of the subregion (i.e. need to
wrap around) - which can only be done by calling
"guc_capture_buf_cnt_to_end".

...

> +
> +   /* Bookkeeping stuff */
> +   guc->log_state[GUC_CAPTURE_LOG_BUFFER].flush += log_buf_state_local.flush_to_file;
> +   new_overflow = intel_guc_check_log_buf_overflow(guc,
> +                                                   &guc->log_state[GUC_CAPTURE_LOG_BUFFER],
> +                                                   full_count);

I am not sure how the overflow logic works here and whether it is
applicable to the error capture buffer. Is the guc log buffer one big
buffer where the error capture is just a portion of that buffer? If so,
is the wrap around applicable to just the errorcapture buffer or to the
whole buffer?

Alan: Yes, the guc log buffer is one big log buffer but there are 3 independent
subregions within that are populated with different content and are used
in different ways and timings. Each guc-log subregion (general-logs,
crash-dump and error-capture) has it's own read and write pointers.

got it. I would also put this one detail in the commit message since 
it's not quickly inferred.

Also what is the wrap_offset field in struct guc_log_buffer_state?

Alan: This is the byte offset of a location in the subregion that is the 1st byte
after the last valid guc entry written by Guc firmware before a wraparound
was done. This would generate a tiny hole at the end of the subregion for better
cacheline alignment when flushing entries into the subregion. However,
the error-capture subregion is dword aligned and all of the output structures
used for error-capture are also dword aligned so this can never happen for the
error-capture subregion.

Makes sense, thanks for clarifying.

Umesh