Re: [PATCH] drm/xe/client: Better correlate exec_queue and GT timestamps

Umesh Nerlige Ramappa <umesh.nerlige.ramappa@xxxxxxxxx> · Fri, 20 Dec 2024 16:32:09 -0800

On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote:
On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote:
On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec
queue file lock usage."). While it's desired to have the mutex to
protect only the reference to the exec queue, getting and dropping each
mutex and then later getting the GPU timestamp, doesn't produce a
correct result: it introduces multiple opportunities for the task to be
scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the
GPU, disable preemption so they can be updated without allowing the task
to be scheduled out. We leave interrupts enabled as that shouldn't be
enough disturbance for the deltas to matter to userspace.

Like I said in the past, this is not trivial to solve and I would 
hate to add anything in the KMD to do so.

I think the best we can do in the kernel side is to try to guarantee the
correlated counters are sampled together... And that is already very
good per my tests. Also, it'd not only be good from a testing
perspective, but for any userspace trying to make sense of the 2
counters.

Note that this is not much different from how e.g. perf samples group
events:

	The unit of scheduling in perf is not an individual event, but rather an
	event group, which may contain one or more events (potentially on
	different PMUs). The notion of an event group is useful for ensuring
	that a set of mathematically related events are all simultaneously
	measured for the same period of time. For example, the number of L1
	cache misses should not be larger than the number of L2 cache accesses.
	Otherwise, it may happen that the events get multiplexed and their
	measurements would no longer be comparable, making the analysis more
	difficult.

See __perf_event_read() that will call pmu->read() on all sibling events
while disabling preemption:

	perf_event_read()
	{
		...
		preempt_disable();
		event_cpu = __perf_event_read_cpu(event, event_cpu);
		...
		(void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
		preempt_enable();
		...
	}

so... at least there's prior art for that... for the same reason that
userspace should see the values sampled together.

Well, I have used the preempt_disable/enable when fixing some selftest 
(i915), but was not happy that there were still some rare failures. If 
reducing error rates is the intention, then it's fine. In my mind, the 
issue still exists and once in a while we would end up assessing such a 
failure. Maybe, in addition, fixing up the IGTs like you suggest below 
is a worthwhile option.



For IGT, why not just take 4 samples for the measurement (separate 
out the 2 counters)

1. get gt timestamp in the first sample
2. get run ticks in the second sample
3. get run ticks in the third sample
4. get gt timestamp in the fourth sample

Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for run ticks 
delta.

this won't fix it for the general case: you get rid of the > 100% case,
you make the < 100% much worse.

yeah, that's quite possible.


For a testing perspective I think the non-flaky solution is to stop
calculating percentages and rather check that the execution timestamp
recorded by the GPU very closely matches (minus gpu scheduling delays)
the one we got via fdinfo once the fence signals and we wait for the job
completion.

Agree, we should change how we validate the counters in IGT.

Thanks,
Umesh

Lucas De Marchi


A user can always sample them together, but rather than focus on few 
values, they should just normalize the utilization over a longer 
period of time to get smoother stats.

Thanks,
Umesh


Test scenario:

	* IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`
	* Platform: LNL, where CI occasionally reports failures
	* `stress -c $(nproc)` running in parallel to disturb the
	  system

This brings a first failure from "after ~150 executions" to "never
occurs after 1000 attempts".

Cc: stable@xxxxxxxxxxxxxxx # v6.11+
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512
Signed-off-by: Lucas De Marchi <lucas.demarchi@xxxxxxxxx>
---
drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
index 298a587da7f17..e307b4d6bab5a 100644
--- a/drivers/gpu/drm/xe/xe_drm_client.c
+++ b/drivers/gpu/drm/xe/xe_drm_client.c
@@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

	/* Accumulate all the exec queues from this client */
	mutex_lock(&xef->exec_queue.lock);
-	xa_for_each(&xef->exec_queue.xa, i, q) {
-		xe_exec_queue_get(q);
-		mutex_unlock(&xef->exec_queue.lock);
+	preempt_disable();

+	xa_for_each(&xef->exec_queue.xa, i, q)
		xe_exec_queue_update_run_ticks(q);

-		mutex_lock(&xef->exec_queue.lock);
-		xe_exec_queue_put(q);
-	}
+	preempt_enable();
	mutex_unlock(&xef->exec_queue.lock);

	gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
--
2.47.0