Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> · Fri, 4 Oct 2019 15:52:17 +0200

On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
>
> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:
> > On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote:
> > >
> > ...
> > >
> > > In the future, I would like to extend these interfaces to use static calls,
> > > so that the accelerated implementations can be [un]plugged at runtime. For
> > > the time being, we rely on weak aliases and conditional exports so that the
> > > users of the library interfaces link directly to the accelerated versions,
> > > but without the ability to unplug them.
> > >
> >
> > As it turns out, we don't actually need static calls for this.
> > Instead, we can simply permit weak symbol references to go unresolved
> > between modules (as we already do in the kernel itself, due to the
> > fact that ELF permits it), and have the accelerated code live in
> > separate modules that may not be loadable on certain systems, or be
> > blacklisted by the user.
>
> You're saying that at module insertion time, the kernel will override
> weak symbols with those provided by the module itself? At runtime?
>

Yes.

> Do you know offhand how this patching works? Is there a PLT that gets
> patched, and so the calls all go through a layer of function pointer
> indirection? Or are all call sites fixed up at insertion time and the
> call instructions rewritten with some runtime patching magic?
>

No magic. Take curve25519 for example, when built for ARM:

00000000 <curve25519>:
   0:   f240 0300       movw    r3, #0
                        0: R_ARM_THM_MOVW_ABS_NC        curve25519_arch
   4:   f2c0 0300       movt    r3, #0
                        4: R_ARM_THM_MOVT_ABS   curve25519_arch
   8:   b570            push    {r4, r5, r6, lr}
   a:   4604            mov     r4, r0
   c:   460d            mov     r5, r1
   e:   4616            mov     r6, r2
  10:   b173            cbz     r3, 30 <curve25519+0x30>
  12:   f7ff fffe       bl      0 <curve25519_arch>
                        12: R_ARM_THM_CALL      curve25519_arch
  16:   b158            cbz     r0, 30 <curve25519+0x30>
  18:   4620            mov     r0, r4
  1a:   2220            movs    r2, #32
  1c:   f240 0100       movw    r1, #0
                        1c: R_ARM_THM_MOVW_ABS_NC       .LANCHOR0
  20:   f2c0 0100       movt    r1, #0
                        20: R_ARM_THM_MOVT_ABS  .LANCHOR0
  24:   f7ff fffe       bl      0 <__crypto_memneq>
                        24: R_ARM_THM_CALL      __crypto_memneq
  28:   3000            adds    r0, #0
  2a:   bf18            it      ne
  2c:   2001            movne   r0, #1
  2e:   bd70            pop     {r4, r5, r6, pc}
  30:   4632            mov     r2, r6
  32:   4629            mov     r1, r5
  34:   4620            mov     r0, r4
  36:   f7ff fffe       bl      0 <curve25519_generic>
                        36: R_ARM_THM_CALL      curve25519_generic
  3a:   e7ed            b.n     18 <curve25519+0x18>

curve25519_arch is a weak reference. It either gets satisfied at
module load time, or it doesn't.

If it does get satisfied, the relocations covering the movw/movt pair
and the one covering the bl instruction get updated so that they point
to the arch routine.

If it does not get satisfied, the relocations are disregarded, in
which case the cbz instruction at offset 0x10 jumps over the bl call.

Note that this does not involve any memory accesses. It does involve
some code patching, but only of the kind the module loader already
does.

> Unless the method is the latter, I would assume that static calls are
> much faster in general? Or the approach already in this series is
> perhaps fine enough, and we don't need to break this apart into
> individual modules complicating everything?

The big advantage this approach has over Zinc is that the accelerated
code does not have to stay resident in memory if the cpu is incapable
of executing it.