Re: [PATCH] drm/xe/client: Better correlate exec_queue and GT timestamps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 08, 2025 at 04:57:40PM -0600, Lucas De Marchi wrote:
On Mon, Jan 06, 2025 at 02:44:27PM -0800, Umesh Nerlige Ramappa wrote:
On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec
queue file lock usage."). While it's desired to have the mutex to
protect only the reference to the exec queue, getting and dropping each
mutex and then later getting the GPU timestamp, doesn't produce a
correct result: it introduces multiple opportunities for the task to be
scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the
GPU, disable preemption so they can be updated without allowing the task
to be scheduled out. We leave interrupts enabled as that shouldn't be
enough disturbance for the deltas to matter to userspace.

Assuming

- timestamp from exec queues = xe_exec_queue_update_run_ticks()
- GPU timestamp = xe_hw_engine_read_timestamp()

shouldn't you also move the xe_hw_engine_read_timestamp() within the preempt_disable/preempt_enable section?

this is what I thought I had done, but it seems I messed up.


Something like this ..

	preempt_disable();

	xa_for_each(&xef->exec_queue.xa, i, q)
		xe_exec_queue_update_run_ticks(q);

	gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

I'd move the gpu_timestamp to be done before the exec queue, so we don't
call it with the exec queue mutex still taken. AFAIR this is what I was
doing in the test machine, but sent the wrong version of the patch. Let
me double check locally here and resend.

just re-tested that and submitted a v2. With and without the pending igt
patch series I can run the test for 1000 times while running cpu stress
and not have any failure.

Lucas De Marchi


thanks
Lucas De Marchi


	preempt_enable();

Thanks,
Umesh


Test scenario:

	* IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`
	* Platform: LNL, where CI occasionally reports failures
	* `stress -c $(nproc)` running in parallel to disturb the
	  system

This brings a first failure from "after ~150 executions" to "never
occurs after 1000 attempts".

Cc: stable@xxxxxxxxxxxxxxx # v6.11+
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512
Signed-off-by: Lucas De Marchi <lucas.demarchi@xxxxxxxxx>
---
drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
index 298a587da7f17..e307b4d6bab5a 100644
--- a/drivers/gpu/drm/xe/xe_drm_client.c
+++ b/drivers/gpu/drm/xe/xe_drm_client.c
@@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

	/* Accumulate all the exec queues from this client */
	mutex_lock(&xef->exec_queue.lock);
-	xa_for_each(&xef->exec_queue.xa, i, q) {
-		xe_exec_queue_get(q);
-		mutex_unlock(&xef->exec_queue.lock);
+	preempt_disable();

+	xa_for_each(&xef->exec_queue.xa, i, q)
		xe_exec_queue_update_run_ticks(q);

-		mutex_lock(&xef->exec_queue.lock);
-		xe_exec_queue_put(q);
-	}
+	preempt_enable();
	mutex_unlock(&xef->exec_queue.lock);

	gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
--
2.47.0





[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux