On Sat, Jan 04, 2025 at 01:19:59AM -0600, Lucas De Marchi wrote: > On Fri, Dec 20, 2024 at 06:42:09PM -0600, Lucas De Marchi wrote: > > On Fri, Dec 20, 2024 at 04:32:09PM -0800, Umesh Nerlige Ramappa wrote: > > > On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote: > > > > On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote: > > > > > On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote: > > > > > > This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec > > > > > > queue file lock usage."). While it's desired to have the mutex to > > > > > > protect only the reference to the exec queue, getting and dropping each > > > > > > mutex and then later getting the GPU timestamp, doesn't produce a > > > > > > correct result: it introduces multiple opportunities for the task to be > > > > > > scheduled out and thus wrecking havoc the deltas reported to userspace. > > > > > > > > > > > > Also, to better correlate the timestamp from the exec queues with the > > > > > > GPU, disable preemption so they can be updated without allowing the task > > > > > > to be scheduled out. We leave interrupts enabled as that shouldn't be > > > > > > enough disturbance for the deltas to matter to userspace. > > > > > > > > > > Like I said in the past, this is not trivial to solve and I > > > > > would hate to add anything in the KMD to do so. > > > > > > > > I think the best we can do in the kernel side is to try to guarantee the > > > > correlated counters are sampled together... And that is already very > > > > good per my tests. Also, it'd not only be good from a testing > > > > perspective, but for any userspace trying to make sense of the 2 > > > > counters. > > > > > > > > Note that this is not much different from how e.g. perf samples group > > > > events: > > > > > > > > The unit of scheduling in perf is not an individual event, but rather an > > > > event group, which may contain one or more events (potentially on > > > > different PMUs). The notion of an event group is useful for ensuring > > > > that a set of mathematically related events are all simultaneously > > > > measured for the same period of time. For example, the number of L1 > > > > cache misses should not be larger than the number of L2 cache accesses. > > > > Otherwise, it may happen that the events get multiplexed and their > > > > measurements would no longer be comparable, making the analysis more > > > > difficult. > > > > > > > > See __perf_event_read() that will call pmu->read() on all sibling events > > > > while disabling preemption: > > > > > > > > perf_event_read() > > > > { > > > > ... > > > > preempt_disable(); > > > > event_cpu = __perf_event_read_cpu(event, event_cpu); > > > > ... > > > > (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); > > > > preempt_enable(); > > > > ... > > > > } > > > > > > > > so... at least there's prior art for that... for the same reason that > > > > userspace should see the values sampled together. > > > > > > Well, I have used the preempt_disable/enable when fixing some > > > selftest (i915), but was not happy that there were still some rare > > > failures. If reducing error rates is the intention, then it's fine. > > > In my mind, the issue still exists and once in a while we would end > > > up assessing such a failure. Maybe, in addition, fixing up the IGTs > > > like you suggest below is a worthwhile option. IMO, we should strive to avoid using low-level calls like preempt_disable and preempt_enable, as they lead to unmaintainable nonsense, as seen in the i915. Sure, in Umesh's example, this is pretty clear and not an unreasonable usage. However, I’m more concerned that this sets a precedent in Xe that doing this is acceptable. Not a blocker, just expressing concerns. Matt > > > > for me this fix is not targeted at tests, even if it improves them a > > lot. It's more for consistent userspace behavior. > > > > > > > > > > > > > > > > > > > For IGT, why not just take 4 samples for the measurement > > > > > (separate out the 2 counters) > > > > > > > > > > 1. get gt timestamp in the first sample > > > > > 2. get run ticks in the second sample > > > > > 3. get run ticks in the third sample > > > > > 4. get gt timestamp in the fourth sample > > > > > > > > > > Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for > > > > > run ticks delta. > > > > > > > > this won't fix it for the general case: you get rid of the > 100% case, > > > > you make the < 100% much worse. > > > > > > yeah, that's quite possible. > > > > > > > > > > > For a testing perspective I think the non-flaky solution is to stop > > > > calculating percentages and rather check that the execution timestamp > > > > recorded by the GPU very closely matches (minus gpu scheduling delays) > > > > the one we got via fdinfo once the fence signals and we wait for the job > > > > completion. > > > > > > Agree, we should change how we validate the counters in IGT. > > > > I have a wip patch to cleanup and submit to igt. I will submit it soon. > > Just submitted that as the last patch in the series: > https://lore.kernel.org/igt-dev/20250104071548.737612-8-lucas.demarchi@xxxxxxxxx/T/#u > > but I'd also like to apply this one in the kernel and still looking for > a review. > > thanks > Lucas De Marchi