Re: crypto: caam from tasklet to threadirq

Russell King - ARM Linux <linux@xxxxxxxxxxxxxxx> · Tue, 20 Sep 2016 22:32:19 +0100

On Tue, Sep 20, 2016 at 10:10:20PM +0200, Thomas Gleixner wrote:
> On Tue, 20 Sep 2016, Russell King - ARM Linux wrote:
> > which corresponds to an 8% slowdown for the threaded IRQ case.  So,
> > tasklets are indeed faster than threaded IRQs.
> 
> Fair enough.
> 
> > I've tried to perf it, but...
> >  ....
>  
> > So, sorry, I'm not going to bother trying to get any further with this.
> > If the job was not made harder stupid hardware design and kernel
> > politics, then I might be more inclined to do deeper investigation, but
> > right now I'm finding that I'm not interested in trying to jump through
> > these stupid hoops.
> 
> I'd be very interested in a sched_switch + irq + softirq trace which does
> not involve PMU hardware for both irqthreads and tasklets, but I can
> understand if you can't be bothered to gather it.

That's involved a rebuild of perf to get it to see the trace events.
What I see probably indicates why the crypto AF_ALG way of doing
things sucks big time - the interface to the crypto backends is
entirely synchronous.

This means that we're guaranteed to get one interrupt per msghdr
entry:

crypto/algif_hash.c:
        lock_sock(sk);
        if (!ctx->more) {
                err = af_alg_wait_for_completion(crypto_ahash_init(&ctx->req),
                                                &ctx->completion);
                if (err)
                        goto unlock;
        }
...
        while (msg_data_left(msg)) {
...
                ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, NULL, len);

                err = af_alg_wait_for_completion(crypto_ahash_update(&ctx->req),
                                                 &ctx->completion);
...
        }
...
        ctx->more = msg->msg_flags & MSG_MORE;
        if (!ctx->more) {
                ahash_request_set_crypt(&ctx->req, NULL, ctx->result, 0);
                err = af_alg_wait_for_completion(crypto_ahash_final(&ctx->req),
                                                 &ctx->completion);
        }

each of those places where af_alg_wait_for_completion() called, we
end up submitting a bunch of data and then immediately waiting for
the operation to complete... and this can be seen in the perf
trace logs.

So, unless I'm mistaken, there's no way for a crypto backend to run
asynchronously, and there's no way for a crypto backend to batch up
the "job" - in order to do that, I think it would have to store quite
a lot of state.

Now, I'm not entirely sure that asking perf to record irq:* and
sched:* events was what we were after - there appears to be no trace
events recorded for entering a threaded IRQ handler.

swapper     0 [000]  3600.260199:             irq:irq_handler_entry: irq=311 name=2101000.jr0
swapper     0 [000]  3600.260209:              irq:irq_handler_exit: irq=311 ret=handled
swapper     0 [000]  3600.260219:                sched:sched_waking: comm=irq/311-2101000 pid=2426 prio=49 target_cpu=000
swapper     0 [000]  3600.260236:                sched:sched_wakeup: comm=irq/311-2101000 pid=2426 prio=49 target_cpu=000
swapper     0 [000]  3600.260258:                sched:sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=irq/311-2101000 next_pid=2426 next_prio=49
irq/311-2101000  2426 [000]  3600.260278:                sched:sched_waking: comm=openssl pid=8005 prio=120 target_cpu=000
irq/311-2101000  2426 [000]  3600.260296:                sched:sched_wakeup: comm=openssl pid=8005 prio=120 target_cpu=000
irq/311-2101000  2426 [000]  3600.260322:                sched:sched_switch: prev_comm=irq/311-2101000 prev_pid=2426 prev_prio=49 prev_state=S ==> next_comm=openssl next_pid=8005 next_prio=120
openssl  8005 [000]  3600.260369:                sched:sched_waking: comm=dd pid=8002 prio=120 target_cpu=000
openssl  8005 [000]  3600.260388:          sched:sched_stat_runtime: comm=openssl pid=8005 runtime=71000 [ns] vruntime=419661176835 [ns]
openssl  8005 [000]  3600.260401:                sched:sched_wakeup: comm=dd pid=8002 prio=120 target_cpu=000
openssl  8005 [000]  3600.260421:                sched:sched_switch: prev_comm=openssl prev_pid=8005 prev_prio=120 prev_state=R ==> next_comm=dd next_pid=8002 next_prio=120
dd  8002 [000]  3600.260459:          sched:sched_stat_runtime: comm=dd pid=8002 runtime=71667 [ns] vruntime=419655248502 [ns]
dd  8002 [000]  3600.260473:                sched:sched_switch: prev_comm=dd prev_pid=8002 prev_prio=120 prev_state=S ==> next_comm=openssl next_pid=8005 next_prio=120
openssl  8005 [000]  3600.260572:          sched:sched_stat_runtime: comm=openssl pid=8005 runtime=112666 [ns] vruntime=419661289501 [ns]
openssl  8005 [000]  3600.260587:                sched:sched_switch: prev_comm=openssl prev_pid=8005 prev_prio=120 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120
swapper     0 [000]  3600.260638:             irq:irq_handler_entry: irq=311 name=2101000.jr0
...

So 123us (260322 - 260199) to the switch to openssl via the threaded IRQ.

tasklet case:

swapper     0 [000]  5082.667101:             irq:irq_handler_entry: irq=311 name=2101000.jr0
swapper     0 [000]  5082.667111:                 irq:softirq_raise: vec=6 [action=TASKLET]
swapper     0 [000]  5082.667119:              irq:irq_handler_exit: irq=311 ret=handled
swapper     0 [000]  5082.667134:                 irq:softirq_entry: vec=6 [action=TASKLET]
swapper     0 [000]  5082.667151:                sched:sched_waking: comm=openssl pid=8251 prio=120 target_cpu=000
swapper     0 [000]  5082.667169:                sched:sched_wakeup: comm=openssl pid=8251 prio=120 target_cpu=000
swapper     0 [000]  5082.667183:                  irq:softirq_exit: vec=6 [action=TASKLET]
swapper     0 [000]  5082.667202:                sched:sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=openssl next_pid=8251 next_prio=120
openssl  8251 [000]  5082.667248:                sched:sched_waking: comm=dd pid=8248 prio=120 target_cpu=000
openssl  8251 [000]  5082.667267:          sched:sched_stat_runtime: comm=openssl pid=8251 runtime=39668 [ns] vruntime=444714027428 [ns]
openssl  8251 [000]  5082.667280:                sched:sched_wakeup: comm=dd pid=8248 prio=120 target_cpu=000
openssl  8251 [000]  5082.667301:                sched:sched_switch: prev_comm=openssl prev_pid=8251 prev_prio=120 prev_state=R ==> next_comm=dd next_pid=8248 next_prio=120
dd  8248 [000]  5082.667339:          sched:sched_stat_runtime: comm=dd pid=8248 runtime=70000 [ns] vruntime=444708097428 [ns]
dd  8248 [000]  5082.667354:                sched:sched_switch: prev_comm=dd prev_pid=8248 prev_prio=120 prev_state=S ==> next_comm=openssl next_pid=8251 next_prio=120
openssl  8251 [000]  5082.667451:          sched:sched_stat_runtime: comm=openssl pid=8251 runtime=113666 [ns] vruntime=444714141094 [ns]
openssl  8251 [000]  5082.667466:                sched:sched_switch: prev_comm=openssl prev_pid=8251 prev_prio=120 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120
swapper     0 [000]  5082.667517:             irq:irq_handler_entry: irq=311 name=2101000.jr0

101us (667202 - 667101) between the same two events, which is 22us
faster than the above.

In both cases, I picked out an irq_handler_entry event which was near
to a whole number of 100us.  I haven't looked any deeper to see what
the variations are in the hard IRQ->openssl schedule latency - I
expect that needs some scripts written.

Attached are compressed files of the perf script -G output.  If you
want the perf.data files, I have them (I'm not sure how useful they
are without the binaries though.)

> Vs. the PMU interrupt thing. What's the politics about that? Do you have
> any pointers?

I just remember there being a discussion about how stupid FSL have been
and "we're not going to support that" - the perf code wants the per-CPU
performance unit interrupts delivered _on_ the CPU to which the perf
unit is attached.  FSL decided in their stupidity to OR all the perf
unit interrupts together and route them to a single common interrupt.

This means that we end up with one CPU taking the perf interrupt for
any perf unit - and the CPUs can only access their local perf unit.
So, if (eg) CPU1's perf unit fires an interrupt, but the common
interrupt is routed to CPU0, CPU0 checks its perf unit, finds no
interrupt, and returns with IRQ_NONE.

There's no mechanism in perf (or anywhere else) to hand the interrupt
over to another CPU.

The result is that trying to run perf on the multi-core iMX SoCs ends
up with the perf interrupt disabled, at which point perf collapses in
a sad pile.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
Attachment:
threaded.txt.bz2

Description: BZip2 compressed data
Attachment:
tasklet.txt.bz2

Description: BZip2 compressed data