Re: [RFC] drm/i915: Move execlists irq handler to a bottom half

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Wed, 23 Mar 2016 10:08:46 +0000

On 23/03/16 09:14, Chris Wilson wrote:
On Wed, Mar 23, 2016 at 10:07:35AM +0100, Daniel Vetter wrote:
On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

Doing a lot of work in the interrupt handler introduces huge
latencies to the system as a whole.

Most dramatic effect can be seen by running an all engine
stress test like igt/gem_exec_nop/all where, when the kernel
config is lean enough, the whole system can be brought into
multi-second periods of complete non-interactivty. That can
look for example like this:

  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
  Modules linked in: [redacted for brevity]
  CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
  Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
  Workqueue: i915 gen6_pm_rps_work [i915]
  task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
  RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
  RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
  RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
  RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
  RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
  R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
  R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
  FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Stack:
   042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
   0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
   0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
  Call Trace:
   <IRQ>
   [<ffffffff8104a716>] irq_exit+0x86/0x90
   [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
   [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
   <EOI>
   [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
   [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
   [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
   [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
   [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
   [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
   [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
   [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
   [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
   [<ffffffff8105ab29>] process_one_work+0x139/0x350
   [<ffffffff8105b186>] worker_thread+0x126/0x490
   [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
   [<ffffffff8105fa64>] kthread+0xc4/0xe0
   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
   [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170

I could not explain, or find a code path, which would explain
a +20 second lockup, but from some instrumentation it was
apparent the interrupts off proportion of time was between
10-25% under heavy load which is quite bad.

By moving the GT interrupt handling to a tasklet in a most
simple way, the problem above disappears completely.

Also, gem_latency -n 100 shows 25% better throughput and CPU
usage, and 14% better latencies.

Forgot gem_syslatency, since that does reflect UX under load really
startlingly well.

gem_syslatency, before:

gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us
gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us
gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us
gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us

with tasklet:

gem_syslatency: cycles=808907, latency mean=53.133us max=1640us
gem_syslatency: cycles=862154, latency mean=62.778us max=2117us
gem_syslatency: cycles=856039, latency mean=58.079us max=2123us
gem_syslatency: cycles=841683, latency mean=56.914us max=1667us

Is this smaller throughput and better latency?

gem_syslatency -n, before:

gem_syslatency: cycles=0, latency mean=2.446us max=18us
gem_syslatency: cycles=0, latency mean=7.220us max=37us
gem_syslatency: cycles=0, latency mean=6.949us max=36us
gem_syslatency: cycles=0, latency mean=5.931us max=39us

with tasklet:

gem_syslatency: cycles=0, latency mean=2.477us max=5us
gem_syslatency: cycles=0, latency mean=2.471us max=6us
gem_syslatency: cycles=0, latency mean=2.696us max=24us
gem_syslatency: cycles=0, latency mean=6.414us max=39us

This looks potentially the same or very similar. May need more runs to 
get a definitive picture.

gem_exec_nop/all has a huge improvement also, if we ignore the fact it 
locks up the system with the curret irq handler on full tilt, when I 
limit the max GPU frequency a bit it can avoid that problem but tasklets 
make it twice as fast here.

I did not find any gains or regressions with Synmark2 or
GLbench under light testing. More benchmarking is certainly
required.

Bugzilla?

You think it is OK to continue sharing your one, 
https://bugs.freedesktop.org/show_bug.cgi?id=93467?

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

I thought tasklets are considered unpopular nowadays? They still steal cpu

Did not know, last (and first) time I've used them was ~15 years ago. :) 
You got any links to read about it? Since (see below) I am not sure they 
"steal" CPU time.

time, just have the benefit of not also disabling hard interrupts. There
should be mitigation though to offload these softinterrupts to threads.
Have you tried to create a threaded interrupt thread just for these pins
instead? A bit of boilerplate, but not much using the genirq stuff iirc.

Ah, you haven't been reading patches. Yes, there's been a patch to fix
the hardlockup using kthreads for a few months. Tvrtko is trying to move
this forward since he too has found a way of locking up his machine
using execlist under load.

Correct.

So far kthreads seems to have a slight edge in the benchmarks, or rather
using tasklet I have some very wild results on Braswell. Using tasklets
the CPU time is accounted to the process (i.e. whoever was running at
> the time of the irq, typically the benchmark), using kthread we have

I thought they run from ksoftirqd so the CPU time is accounted against 
it. And looking at top, that even seems what actually happens.

independent entries in the process table/top (which is quite nice to see
just how much time is been eaten up by the context-switches).

Benchmarks still progessing, haven't yet got on to the latency
measureemnts....

My tasklets hack required surprisingly little code change, at least if 
there are not some missed corner cases to handle, but I don't mind your 
threads either.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx