Re: [PATCH] i915/perf: Avoid reading OA reports before they land

Umesh Nerlige Ramappa <umesh.nerlige.ramappa@xxxxxxxxx> · Mon, 22 May 2023 15:08:42 -0700

On Mon, May 22, 2023 at 01:20:12PM -0700, Dixit, Ashutosh wrote:
On Fri, 19 May 2023 15:56:42 -0700, Umesh Nerlige Ramappa wrote:


Hi Umesh,

On DG2, capturing OA reports while running heavy render workloads
sometimes results in invalid OA reports where 64-byte chunks inside
reports have stale values. Under memory pressure, high OA sampling rates
(13.3 us) and heavy render workload, occassionally, the OA HW TAIL
pointer does not progress as fast as the sampling rate. When these
glitches occur, the TAIL pointer takes approx. 200us to progress.  While
this is expected behavior from the HW perspective, invalid reports are
not expected.

In oa_buffer_check_unlocked(), when we execute the if condition, we are
updating the oa_buffer.tail to the aging tail and then setting pollin
based on this tail value, however, we do not have a chance to rewind and
validate the reports prior to setting pollin. The validation happens
in a subsequent call to oa_buffer_check_unlocked(). If a read occurs
before this validation, then we end up reading reports up until this
oa_buffer.tail value which includes invalid reports. Though found on
DG2, this affects all platforms.

Set the pollin only in the else condition in oa_buffer_check_unlocked.

Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/7484
Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/7757
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@xxxxxxxxx>
---
 drivers/gpu/drm/i915/i915_perf.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 19d5652300ee..61536e3c4ac9 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -545,7 +545,7 @@ static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream)
	u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma);
	int report_size = stream->oa_buffer.format->size;
	unsigned long flags;
-	bool pollin;
+	bool pollin = false;
	u32 hw_tail;
	u64 now;
	u32 partial_report_size;
@@ -620,10 +620,10 @@ static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream)
		stream->oa_buffer.tail = gtt_offset + tail;
		stream->oa_buffer.aging_tail = gtt_offset + hw_tail;
		stream->oa_buffer.aging_timestamp = now;
-	}

-	pollin = OA_TAKEN(stream->oa_buffer.tail - gtt_offset,
-			  stream->oa_buffer.head - gtt_offset) >= report_size;
+		pollin = OA_TAKEN(stream->oa_buffer.tail - gtt_offset,
+				  stream->oa_buffer.head - gtt_offset) >= report_size;
+	}

The issue has been correctly identified above. But seems that the real
cause for the issue is not that pollin statement above is misplaced but
that updating the tail via aging is unreliable (at least with the present
timeout as you mention above). Also, it is not clear why we have tail aging
at all, since it seems we can detect when reports land (by checking
report_id and timestamp). So rather than move the pollin into the else, we
should just eliminate the if () part. And if we are eliminating the if ()
we can just eliminate the concept of tail aging from the code (and
comments) and rely solely on explicit detection of reports landing.

I thought so too, it would be much simpler code. Looks like Lionel 
agrees with removing this code as well. 

I do have a couple concerns though.

- In the blocking case, i915_perf_read() path waits on a queue with the
condition being oa_buffer_check_unlocked(). If sampling rate is high, 
oa_buffer_check_unlocked will almost always return true. If we remove 
the if block, we may run the rewind logic too often to detect reports 
that landed. The aging logic is just giving a 100us buffer to avoid 
repeated checks here if tail hasn't moved. (although tbh, 100 us is very 
small).

- The other concern - by dropping all this aging logic, are we changing 
  underlying behavior?

- Is there a significant ROI on current patch vs. dropping all the aging 
  logic?


Separately, there seems to be another related bug in the code, I have sent
a patch for that here:

https://patchwork.freedesktop.org/series/118151/

That's a valid new issue and different from this one, but related to the 
rewind logic. lgtm.

Thanks,
Umesh

Thanks.
--
Ashutosh