Re: [PATCH] drm/xe/client: Better correlate exec_queue and GT timestamps

Matthew Brost <matthew.brost@xxxxxxxxx> · Thu, 9 Jan 2025 12:21:54 -0800

On Wed, Jan 08, 2025 at 11:19:52PM -0600, Lucas De Marchi wrote:
> On Wed, Jan 08, 2025 at 08:50:43PM -0800, Matthew Brost wrote:
> > On Sat, Jan 04, 2025 at 01:19:59AM -0600, Lucas De Marchi wrote:
> > > On Fri, Dec 20, 2024 at 06:42:09PM -0600, Lucas De Marchi wrote:
> > > > On Fri, Dec 20, 2024 at 04:32:09PM -0800, Umesh Nerlige Ramappa wrote:
> > > > > On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote:
> > > > > > On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote:
> > > > > > > On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:
> > > > > > > > This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec
> > > > > > > > queue file lock usage."). While it's desired to have the mutex to
> > > > > > > > protect only the reference to the exec queue, getting and dropping each
> > > > > > > > mutex and then later getting the GPU timestamp, doesn't produce a
> > > > > > > > correct result: it introduces multiple opportunities for the task to be
> > > > > > > > scheduled out and thus wrecking havoc the deltas reported to userspace.
> > > > > > > >
> > > > > > > > Also, to better correlate the timestamp from the exec queues with the
> > > > > > > > GPU, disable preemption so they can be updated without allowing the task
> > > > > > > > to be scheduled out. We leave interrupts enabled as that shouldn't be
> > > > > > > > enough disturbance for the deltas to matter to userspace.
> > > > > > >
> > > > > > > Like I said in the past, this is not trivial to solve and I
> > > > > > > would hate to add anything in the KMD to do so.
> > > > > >
> > > > > > I think the best we can do in the kernel side is to try to guarantee the
> > > > > > correlated counters are sampled together... And that is already very
> > > > > > good per my tests. Also, it'd not only be good from a testing
> > > > > > perspective, but for any userspace trying to make sense of the 2
> > > > > > counters.
> > > > > >
> > > > > > Note that this is not much different from how e.g. perf samples group
> > > > > > events:
> > > > > >
> > > > > > 	The unit of scheduling in perf is not an individual event, but rather an
> > > > > > 	event group, which may contain one or more events (potentially on
> > > > > > 	different PMUs). The notion of an event group is useful for ensuring
> > > > > > 	that a set of mathematically related events are all simultaneously
> > > > > > 	measured for the same period of time. For example, the number of L1
> > > > > > 	cache misses should not be larger than the number of L2 cache accesses.
> > > > > > 	Otherwise, it may happen that the events get multiplexed and their
> > > > > > 	measurements would no longer be comparable, making the analysis more
> > > > > > 	difficult.
> > > > > >
> > > > > > See __perf_event_read() that will call pmu->read() on all sibling events
> > > > > > while disabling preemption:
> > > > > >
> > > > > > 	perf_event_read()
> > > > > > 	{
> > > > > > 		...
> > > > > > 		preempt_disable();
> > > > > > 		event_cpu = __perf_event_read_cpu(event, event_cpu);
> > > > > > 		...
> > > > > > 		(void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
> > > > > > 		preempt_enable();
> > > > > > 		...
> > > > > > 	}
> > > > > >
> > > > > > so... at least there's prior art for that... for the same reason that
> > > > > > userspace should see the values sampled together.
> > > > >
> > > > > Well, I have used the preempt_disable/enable when fixing some
> > > > > selftest (i915), but was not happy that there were still some rare
> > > > > failures. If reducing error rates is the intention, then it's fine.
> > > > > In my mind, the issue still exists and once in a while we would end
> > > > > up assessing such a failure. Maybe, in addition, fixing up the IGTs
> > > > > like you suggest below is a worthwhile option.
> > 
> > IMO, we should strive to avoid using low-level calls like
> > preempt_disable and preempt_enable, as they lead to unmaintainable
> > nonsense, as seen in the i915.
> > 
> > Sure, in Umesh's example, this is pretty clear and not an unreasonable
> > usage. However, I’m more concerned that this sets a precedent in Xe that
> > doing this is acceptable.
> 
> each such usage needs to be carefully reviewed, but there are cases

+1 to carefully reviewing each usage. That goes for any low-level kernel
calls which typically shouldn't be used by drivers.

> in which it's useful and we shouldn't ban it. In my early reply on
> wdcrw3du2ssykmsrda3mvwjhreengeovwasikmixdiowqsfnwj@lsputsgtmha4  I even
> showed how the same construct is used by perf when reading counters
> that should be sampled together.

Yea, this usage looks fine. AMD seems to do something similar to
programming / reading HW registers which are tightly coupled.

Matt

> 
> I will post a new version, but I will delay in a week or so merging it.
> That's because I want to check the result of the other patch series that
> changes the test in igt and afaics should give all green results all the
> time: https://patchwork.freedesktop.org/series/143204/
> 
> Lucas De Marchi
> 
> > 
> > Not a blocker, just expressing concerns.
> > 
> > Matt
> > 
> > > >
> > > > for me this fix is not targeted at tests, even if it improves them a
> > > > lot. It's more for consistent userspace behavior.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > For IGT, why not just take 4 samples for the measurement
> > > > > > > (separate out the 2 counters)
> > > > > > >
> > > > > > > 1. get gt timestamp in the first sample
> > > > > > > 2. get run ticks in the second sample
> > > > > > > 3. get run ticks in the third sample
> > > > > > > 4. get gt timestamp in the fourth sample
> > > > > > >
> > > > > > > Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for
> > > > > > > run ticks delta.
> > > > > >
> > > > > > this won't fix it for the general case: you get rid of the > 100% case,
> > > > > > you make the < 100% much worse.
> > > > >
> > > > > yeah, that's quite possible.
> > > > >
> > > > > >
> > > > > > For a testing perspective I think the non-flaky solution is to stop
> > > > > > calculating percentages and rather check that the execution timestamp
> > > > > > recorded by the GPU very closely matches (minus gpu scheduling delays)
> > > > > > the one we got via fdinfo once the fence signals and we wait for the job
> > > > > > completion.
> > > > >
> > > > > Agree, we should change how we validate the counters in IGT.
> > > >
> > > > I have a wip patch to cleanup and submit to igt. I will submit it soon.
> > > 
> > > Just submitted that as the last patch in the series:
> > > https://lore.kernel.org/igt-dev/20250104071548.737612-8-lucas.demarchi@xxxxxxxxx/T/#u
> > > 
> > > but I'd also like to apply this one in the kernel and still looking for
> > > a review.
> > > 
> > > thanks
> > > Lucas De Marchi