Re: [RFC v3 00/11] i915 PMU and engine busy stats

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 12/09/2017 03:03, Rogozhkin, Dmitry V wrote:
Hi,

Just tried v3 series. perf-stat works fine. From the IGT tests which I wrote for i915 PMU
(https://patchwork.freedesktop.org/series/29313/) all pass (assuming pmu.enabled will be exposed
in debugfs) except cpu_online subtest. And this is pretty interesting - see details below.

Ok, be prepared for the long mail:)...

So, cpu_online subtest loads RCS0 engine 100% and starts to put CPUs offline one by one starting
from CPU0 (don't forget to have CONFIG_BOOTPARAM_HOTPLUG_CPU0=y in .config). Test expectation is
that i915 PMU will report almost 100% (with 2% tolerance) busyness of RCS0. Test requests metric
just twice: before running on RCS0 and right after. This fails as follows:

Executed on rcs0 for 32004262184us
   i915/rcs0-busy/: 2225723999us
(perf_pmu:6325) CRITICAL: Test assertion failure function test_cpu_online, file perf_pmu.c:719:
(perf_pmu:6325) CRITICAL: Failed assertion: perf_elapsed(&pm.metrics[0]) > (1-USAGE_TOLERANCE) * elapsed_ns(&start, &now)
Stack trace:
   #0 [__igt_fail_assert+0xf1]
   #1 [__real_main773+0xff1]
   #2 [main+0x35]
   #3 [__libc_start_main+0xf5]
   #4 [_start+0x29]
   #5 [<unknown>+0x29]
Subtest cpu_online failed.
**** DEBUG ****
(perf_pmu:6325) DEBUG: Test requirement passed: is_hotplug_cpu0()
(perf_pmu:6325) INFO: perf_init: enabled 1 metrics from 1 requested
(perf_pmu:6325) ioctl-wrappers-DEBUG: Test requirement passed: gem_has_ring(fd, ring)
(perf_pmu:6325) INFO: Executed on rcs0 for 32004262184us
(perf_pmu:6325) INFO:   i915/rcs0-busy/: 2225723999us
(perf_pmu:6325) CRITICAL: Test assertion failure function test_cpu_online, file perf_pmu.c:719:
(perf_pmu:6325) CRITICAL: Failed assertion: perf_elapsed(&pm.metrics[0]) > (1-USAGE_TOLERANCE) * elapsed_ns(&start, &now)

Now. Looks like that by itself PMU context migration works. For example, if you will comment out
"perf_pmu_migrate_context(&pmu->base, cpu, target)" you will get:

     Executed on rcs0 for 32004434918us
       i915/rcs0-busy/:     76623707us

Compare with previous:
     Executed on rcs0 for 32004262184us
       i915/rcs0-busy/:    2225723999us

This test passed on the previous set of patches, I mean Tvrtko's v2 series + my patches.

So, it seems we are loosing counter values somehow. I saw in the patches that this place really was modified - you
have added subtraction from initial counter value:
static void i915_pmu_event_read(struct perf_event *event)
{

	local64_set(&event->count,
		    __i915_pmu_event_read(event) -
		    local64_read(&event->hw.prev_count));
}

But looks like the problem is that with the PMU context migration we get sequence of events start/stop (or maybe
add/del) which eventually call our i915_pmu_enable/disable. Here is the dmesg log with the obvious printk:

[  153.971096] [IGT] perf_pmu: starting subtest cpu_online
[  153.971151] i915_pmu_enable: event->hw.prev_count=0
[  154.036015] i915_pmu_disable: event->hw.prev_count=0
[  154.048027] i915_pmu_enable: event->hw.prev_count=0
[  154.049343] smpboot: CPU 0 is now offline
[  155.059028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  155.155078] smpboot: CPU 1 is now offline
[  156.161026] x86: Booting SMP configuration:
[  156.161027] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  156.197065] IRQ 122: no longer affine to CPU2
[  156.198087] smpboot: CPU 2 is now offline
[  157.208028] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  157.263093] smpboot: CPU 3 is now offline
[  158.273026] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  158.310026] i915_pmu_disable: event->hw.prev_count=76648307
[  158.319020] i915_pmu_enable: event->hw.prev_count=76648307
[  158.319098] IRQ 124: no longer affine to CPU4
[  158.320368] smpboot: CPU 4 is now offline
[  159.326030] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  159.365306] smpboot: CPU 5 is now offline
[  160.371030] smpboot: Booting Node 0 Processor 5 APIC 0x3
[  160.421077] IRQ 125: no longer affine to CPU6
[  160.422093] smpboot: CPU 6 is now offline
[  161.429030] smpboot: Booting Node 0 Processor 6 APIC 0x5
[  161.467091] smpboot: CPU 7 is now offline
[  162.473027] smpboot: Booting Node 0 Processor 7 APIC 0x7
[  162.527019] i915_pmu_disable: event->hw.prev_count=4347548222
[  162.546017] i915_pmu_enable: event->hw.prev_count=4347548222
[  162.547317] smpboot: CPU 0 is now offline
[  163.553028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  163.621089] smpboot: CPU 1 is now offline
[  164.627028] x86: Booting SMP configuration:
[  164.627029] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  164.669308] smpboot: CPU 2 is now offline
[  165.679025] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  165.717089] smpboot: CPU 3 is now offline
[  166.723025] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  166.775016] i915_pmu_disable: event->hw.prev_count=8574197312
[  166.787016] i915_pmu_enable: event->hw.prev_count=8574197312
[  166.788309] smpboot: CPU 4 is now offline
[  167.794025] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  167.837114] smpboot: CPU 5 is now offline
[  168.847025] smpboot: Booting Node 0 Processor 5 APIC 0x3
[  168.889312] smpboot: CPU 6 is now offline
[  169.899030] smpboot: Booting Node 0 Processor 6 APIC 0x5
[  169.944104] smpboot: CPU 7 is now offline
[  170.954032] smpboot: Booting Node 0 Processor 7 APIC 0x7
[  171.000016] i915_pmu_disable: event->hw.prev_count=12815138319
[  171.008017] i915_pmu_enable: event->hw.prev_count=12815138319
[  171.009304] smpboot: CPU 0 is now offline
[  172.017028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  172.096104] smpboot: CPU 1 is now offline
[  173.106025] x86: Booting SMP configuration:
[  173.106026] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  173.147078] smpboot: CPU 2 is now offline
[  174.153025] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  174.192093] smpboot: CPU 3 is now offline
[  175.198028] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  175.229042] i915_pmu_disable: event->hw.prev_count=17035889079
[  175.242030] i915_pmu_enable: event->hw.prev_count=17035889079
[  175.242163] IRQ fixup: irq 120 move in progress, old vector 131
[  175.242165] IRQ fixup: irq 121 move in progress, old vector 147
[  175.242171] IRQ 124: no longer affine to CPU4
[  175.243435] smpboot: CPU 4 is now offline
[  176.248040] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  176.285328] smpboot: CPU 5 is now offline
[  177.296039] smpboot: Booting Node 0 Processor 5 APIC 0x3
[  177.325067] IRQ 125: no longer affine to CPU6
[  177.326087] smpboot: CPU 6 is now offline
[  178.335036] smpboot: Booting Node 0 Processor 6 APIC 0x5
[  178.377063] IRQ 122: no longer affine to CPU7
[  178.378086] smpboot: CPU 7 is now offline
[  179.388028] smpboot: Booting Node 0 Processor 7 APIC 0x7
[  179.454030] i915_pmu_disable: event->hw.prev_count=21269856967
[  179.470026] i915_pmu_enable: event->hw.prev_count=21269856967
[  179.471110] smpboot: CPU 0 is now offline
[  180.481028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  180.551075] smpboot: CPU 1 is now offline
[  181.558029] x86: Booting SMP configuration:
[  181.558030] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  181.595096] smpboot: CPU 2 is now offline
[  182.605029] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  182.657084] smpboot: CPU 3 is now offline
[  183.668030] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  183.709017] i915_pmu_disable: event->hw.prev_count=25497358644
[  183.727016] i915_pmu_enable: event->hw.prev_count=25497358644
[  183.728305] smpboot: CPU 4 is now offline
[  184.734027] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  184.767090] smpboot: CPU 5 is now offline
[  185.777036] smpboot: Booting Node 0 Processor 5 APIC 0x3
[  185.823096] smpboot: CPU 6 is now offline
[  186.829051] smpboot: Booting Node 0 Processor 6 APIC 0x5
[  186.856350] smpboot: CPU 7 is now offline
[  187.862051] smpboot: Booting Node 0 Processor 7 APIC 0x7
[  187.871216] [IGT] perf_pmu: exiting, ret=99
[  187.889199] Console: switching to colour frame buffer device 240x67
[  187.889583] i915_pmu_disable: event->hw.prev_count=29754080941

And the result which I got in userspace for this run were
     Executed on rcs0 for 32003587981us
       i915/rcs0-busy/: 2247436461us

After that I decided to roll back the change with counting values which I mentioned before, i.e.:
static void i915_pmu_event_read(struct perf_event *event)
{

	local64_set(&event->count,
		    __i915_pmu_event_read(event) /*-
		    local64_read(&event->hw.prev_count)*/);
}

And I got test PASSED :):
     Executed on rcs0 for 32002282603us
       i915/rcs0-busy/: 31998855052us
     Subtest cpu_online: SUCCESS (33.950s)

At this point I need to go home :). Maybe you will have time to look into this issue? If not, I will continue
tomorrow.

I forgot to run this test since I did not have the kernel feature enabled. But yes, now that I tried it, it is failing.

What is happening is that event del (so counter stop as well) is getting called when the CPU goes offline, followed by add->start, and the initial counter value then gets reloaded.

I don't see a way for i915 to distinguish between userspace starting/stopping the event, and perf core doing the same in the CPU migration process. Perhaps Peter could help here?

I am storing the initial counter value when the counter is started so that I can report it's relative value. In other words the change from event start to stop. Perhaps that is not correct and should be left to userspace to handle?

Otherwise we have counters like energy use, and even engine busyness, which will/can already be at some large value before PMU monitoring starts. Which makes things like "perf stat -a -I <command>", or even just normal "perf stat <command>", attribute all previous usage (from before the command profiling started) to the reported stats.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux