On Tue, 2017-08-22 at 20:17 +0200, Peter Zijlstra wrote: > On Sat, Aug 12, 2017 at 02:15:13AM +0000, Rogozhkin, Dmitry V wrote: > > $ perf stat -e instructions,i915/rcs0-busy/ workload.sh > > <... wrokload.sh output...> > > > > Performance counter stats for 'workload.sh': > > 1,204,616,268 instructions > > 0 i915/rcs0-busy/ > > > > 1.869169153 seconds time elapsed > > > > As you can see instructions event works pretty well, i915/rcs0-busy/ > > doesn't. > > > > I afraid that our current understanding of how PMU should work is not > > fully correct. > > Can we start off by explaining to me how this i915 stuff works. Because > all I have is ~750 lines of patch without comments. Which sort of leaves > me confused. i915 PMU tries to expose a number of SW metrics to characterize i915 performance: i915 interrupt counts, time (counted in microseconds) GPU spent in specific states like RC6, time (counted in microseconds) GPU engines were busy executing some tasks or were idle, etc. As for now these metrics are i915 global and can't be attached to some particular task, however, we may consider such metrics in the future. I think that global PMU is the nice first step. I had a talk with Andi Kleen who pointed that this is similar to how uncore PMU behaves and suggested I should use 'perf stat -e <event> -a -C0' and probably adjust the code to register i915 PMU with the perf_invalid_context. Which I did in the patch which probably should be squashed to the patch from Tvrtko (I did not want to intersect with his job). This change can be found here: https://patchwork.freedesktop.org/patch/171953/. This patch makes 'perf stat -e i915/rcs0-busy/' to error out and supports 'perf stat -e i915/rcs0-busy/ -a -C0'. I still think I miss something since 'perf stat -e i915/rcs0-busy/ -a' will give metrics multiplied by number of active CPUs, but that's look closer to what is needed. Anyhow, returning to the metrics i915 exposes. Some metrics are just exposure of some counters supported already inside i915 PMU which do not require any special sampling: at any given moment you can request the counter value (these are interrupts counts, i915 power consumption). Other metrics are similar to the ever-existing which I just described, but they require activation for i915 to start to count them - this is done on the event initialization (these are engine busy stats). Finally, there is a third group which require sampling counting: they are needed to be initialized and i915 pmu starts an internal timer to count these values (these are some engines characteristics referenced in the code as QUEUED, SEMA, WAIT). I hope this clarifies. As well I hope I correctly covered high level description of i915 PMU. I hope Chris will correct me if I was wrong somewhere. Mind that another author of i915 PMU Tvrtko is for one more week or so on vacation. > > The above command tries to add an event 'i915/rcs0-busy/' to a task. How > are i915 resource associated to any one particular task? Currently in no way, they are global. > > Is there a unique i915 resource for each task? If not, I don't see how > per-task event can ever work as expected. This depends on what you mean under "expected"? I see 2 variants: 1. Either this command line should return correct metric values 2. This command line should error out Right now i915 PMU produces global metric, thus, I think 2nd variant is true. Which I think is achievable if PMU is registered with perf_invalid_context. > > > I think so, because the way PMU entry points init(), > > add(), del(), start(), stop(), read() are implemented do not correlate > > with how many times they are called. I have counted them and here is the > > result: > > init()=19, add()=44310, del()=43900, start()=44534, stop()=0, read()=0 > > > > Which means that we are regularly attempt to start/stop timer and/or > > busy stats calculations. Another thing which pay attention is that > > read() was not called at all. How perf supposes to get counter value? > > Both stop() and del() are supposed to update event->count. Only if we do > sys_read() while the event is active (something perf-stat never does > IIRC) will it issue pmu::read() to get an up-to-date number. > > > Yet another thing, where we are supposed to initialize our internal > > staff: numbers above are from single run and even init is called > > multiple times? Where we are supposed to de-init our staff: each time on > > del() - this hardly makes sense? > > init happens in pmu::event_init(), that can set an optional > event->destroy() function for de-init. > > init() is called once for each event created, the above creates an > inherited per-task event (I think, I lost track of what perf tool does) > and 19 seems to suggest you did some 18 fork()/clone() calls after that, > resulting in your 1 parent event with 18 children. This behavior changed significantly once I have registered i915 PMU with the perf_invalid_context notifying perf subsystem that out PMU produces global metrics. Right now what I have is: 1. If I call "perf stat -e <event> -a -C0", then event is initialized once 2. If I call "perf stat -e <event> -a", then event is initialized as many times as there are logical CPUs. On my system there were 8. And there is another problem associated with the "perf stat -e <event> -a": since 8 events are initialized, I get metric values multiplied by 8. How I am supposed to fix that? I suspect that this is somehow related to cpumask attribute exposed by some other PMUs, for example uncore. They are doing something like: static cpumask_t uncore_cpu_mask; static DEVICE_ATTR(cpumask, S_IRUGO, uncore_get_attr_cpumask, NULL); static struct attribute *uncore_pmu_attrs[] = { &dev_attr_cpumask.attr, NULL, }; static struct attribute_group uncore_pmu_attr_group = { .attrs = uncore_pmu_attrs, }; Should i915 follow this way as well? > > > I should note that if perf will be issued with -I 10 option, then read() > > is being called: init_c()=265, add_c()=132726, del_c()=131482, > > start_c()=133412, stop()=0, read()=71. However, i915 counter is still 0. > > I have tried to print counter values from within read() and these values > > are non 0. Actually read() returns sequence of <non_zero>, 0, 0, 0, ..., > > <no_zero> because with our add(), del() code we regularly start/stop our > > counter and execution in read() follows different branches. > > > > Thus, I think that right now we do not implement PMU correctly and do > > not meet perf expectations from the PMU. Unfortunately, right now I have > > no idea what are these expectations. > > Please as to clarify how i915 works, I have no idea where to go. _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx