Re: [PATCH 2/2] drm/i915/tracepoints: Remove DRM_I915_LOW_LEVEL_TRACEPOINTS Kconfig option

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Wed, 8 Aug 2018 13:13:08 +0100

On 26/06/2018 12:48, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-06-26 12:24:51)

On 26/06/2018 11:55, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-06-26 11:46:51)

On 25/06/2018 21:02, Chris Wilson wrote:
If we know what is wanted can we define that better in terms of
dma_fence and leave lowlevel for debugging (or think of how we achieve
the same with generic bpf? kprobes)? Hmm, I wonder how far we can push
that.

What is wanted is for instance take trace.pl on any kernel anywhere and
it is able to deduce/draw the exact metrics/timeline of command
submission for an workload.

At the moment it without low level tracepoints, and without the
intel_engine_notify tweak, it is workload dependent on how close it
could get.

Interjecting what dma-fence already has (or we could use), not sure how
well userspace can actually map it to their timelines.

So a set of tracepoints to allow drawing the timeline:

1. request_queue (or _add)
dma_fence_init

2. request_submit

3. intel_engine_notify
For obvious reasons, no match in dma_fence.

4. request_in
dma_fence_emit

5. request out
dma_fence_signal (similar, not quite, we would have to force irq
signaling).

Yes not quite the same due potential time shift between user interrupt
and dma_fence_signal call via different paths.

With this set the above is possible and we don't need a lot of work to
get there.

  From a brief glance we are missing a dma_fence_queue for request_submit
replacement.

So next question is what information do we get from our tracepoints (or
more precisely do you use) that we lack in dma_fence?

Port=%u and preemption (completed=%u) comes immediately to mind. Way to
tie with engines would be nice or it is all abstract timelines.

Going this direction sounds like a long detour to get where we almost
are. I suspect you are valuing the benefit of it being generic and hence
and parsing tool could be cross-driver. But you can also just punt the
"abstractising" into the parsing tool.

It's just that this about the third time this has been raised in the
last couple of weeks with the other two requests being from a generic
tooling pov (Eric Anholt for gnome-shell tweaking, and some one
else looking for a gpuvis-like tool). So it seems like there is
interest, even if I doubt that it'll help answer any questions beyond
what you can just extract from looking at userspace. (Imo, the only
people these tracepoints are useful for are people writing patches for
the driver. For everyone else, you can just observe system behaviour and
optimise your code for your workload. Otoh, can one trust a black
box, argh.)

Some of the things might be obtainable purely from userspace via heavily 
instrumented builds, which may be in the realm of possible for during 
development, but I don't think it is feasible in general both because it 
is too involved, and because it would preclude existence of tools which 
can trace any random client.

To have a second set of nearly equivalent tracepoints, we need to have
strong justification why we couldn't just use or extend the generic set.

I was hoping that the conversation so far established that nearly 
equivalent is not close enough for intended use cases. And that is not 
possible to make the generic ones so.

Plus I feel a lot more comfortable exporting a set of generic
tracepoints, than those where we may be leaking more knowledge of the HW
than we can reasonably expect to support for the indefinite future.

I think it is accepted we cannot guarantee low level tracepoints will be 
supportable in the future world of GuC scheduling. (How and what we will 
do there is yet unresolved.) But at least we get much better usability 
for platforms up to there, and for very small effort. The idea is not to 
mark these as ABI but just improve user experience.

You are I suppose worried that if these tracepoints disappeared due 
being un-implementable someone will complain?

I just want that anyone can run trace.pl and see how virtual engine 
behaves, without having to recompile the kernel. And VTune people want 
the same for their enterprise-level customers. Both tools are ready to 
adapt should it be required. Its I repeat just usability and user 
experience out of the box.

And with the Virtual Engine it will become more interesting to have
this. So if we had a bug report saying load balancing is not working
well, we could just say "please run it via trace.pl --trace and attach
perf script output". That way we could easily see whether or not is is a
problem in userspace behaviour or else.

And there I was wanting a script to capture the workload so that we
could replay it and dissect it. :-p

Depends on what level you want that. Perf script output from the above
tracepoints would do on one level. If you wanted a higher level to
re-exercise load balancing then it wouldn't completely be enough, or at
least a lot of guesswork would be needed.

It all depends on what level you want to optimise, is the way I look at
it. Userspace driver, you capture the client->driver userspace API (e.g.
cairo-trace, apitrace). But for optimising scheduling layout, we just
need a workload descriptor like wsim -- with perhaps the only tweak
being able to define latency/throughput metrics relevant to that
workload, and being able to integrate with a pseudo display server. The
challenge as I see it is being able to convince the user that it is a
useful diagnosis step and being able to generate a reasonable wsim
automatically.

To derive wsim's from apitraces sounds much more challenging but also I 
think is orthogonal. Tracing could be always there on the low level 
whether the client is real or simulated.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx