Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Sun, 6 Oct 2019 21:44:15 -0700

> On Oct 5, 2019, at 12:24 AM, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote:
> 
> On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote:
>> 
>>> On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>> 
>>> 
>>> 
>>>> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote:
>>>> 
>>>> On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
>>>>> 
>>>>>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:
>>>>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote:
>>>>>>> 
>>>>>> ...
>>>>>>> 
>>>>>>> In the future, I would like to extend these interfaces to use static calls,
>>>>>>> so that the accelerated implementations can be [un]plugged at runtime. For
>>>>>>> the time being, we rely on weak aliases and conditional exports so that the
>>>>>>> users of the library interfaces link directly to the accelerated versions,
>>>>>>> but without the ability to unplug them.
>>>>>>> 
>>>>>> 
>>>>>> As it turns out, we don't actually need static calls for this.
>>>>>> Instead, we can simply permit weak symbol references to go unresolved
>>>>>> between modules (as we already do in the kernel itself, due to the
>>>>>> fact that ELF permits it), and have the accelerated code live in
>>>>>> separate modules that may not be loadable on certain systems, or be
>>>>>> blacklisted by the user.
>>>>> 
>>>>> You're saying that at module insertion time, the kernel will override
>>>>> weak symbols with those provided by the module itself? At runtime?
>>>>> 
>>>> 
>>>> Yes.
>>>> 
>>>>> Do you know offhand how this patching works? Is there a PLT that gets
>>>>> patched, and so the calls all go through a layer of function pointer
>>>>> indirection? Or are all call sites fixed up at insertion time and the
>>>>> call instructions rewritten with some runtime patching magic?
>>>>> 
>>>> 
>>>> No magic. Take curve25519 for example, when built for ARM:
>>>> 
>>>> 00000000 <curve25519>:
>>>>  0:   f240 0300       movw    r3, #0
>>>>                       0: R_ARM_THM_MOVW_ABS_NC        curve25519_arch
>>>>  4:   f2c0 0300       movt    r3, #0
>>>>                       4: R_ARM_THM_MOVT_ABS   curve25519_arch
>>>>  8:   b570            push    {r4, r5, r6, lr}
>>>>  a:   4604            mov     r4, r0
>>>>  c:   460d            mov     r5, r1
>>>>  e:   4616            mov     r6, r2
>>>> 10:   b173            cbz     r3, 30 <curve25519+0x30>
>>>> 12:   f7ff fffe       bl      0 <curve25519_arch>
>>>>                       12: R_ARM_THM_CALL      curve25519_arch
>>>> 16:   b158            cbz     r0, 30 <curve25519+0x30>
>>>> 18:   4620            mov     r0, r4
>>>> 1a:   2220            movs    r2, #32
>>>> 1c:   f240 0100       movw    r1, #0
>>>>                       1c: R_ARM_THM_MOVW_ABS_NC       .LANCHOR0
>>>> 20:   f2c0 0100       movt    r1, #0
>>>>                       20: R_ARM_THM_MOVT_ABS  .LANCHOR0
>>>> 24:   f7ff fffe       bl      0 <__crypto_memneq>
>>>>                       24: R_ARM_THM_CALL      __crypto_memneq
>>>> 28:   3000            adds    r0, #0
>>>> 2a:   bf18            it      ne
>>>> 2c:   2001            movne   r0, #1
>>>> 2e:   bd70            pop     {r4, r5, r6, pc}
>>>> 30:   4632            mov     r2, r6
>>>> 32:   4629            mov     r1, r5
>>>> 34:   4620            mov     r0, r4
>>>> 36:   f7ff fffe       bl      0 <curve25519_generic>
>>>>                       36: R_ARM_THM_CALL      curve25519_generic
>>>> 3a:   e7ed            b.n     18 <curve25519+0x18>
>>>> 
>>>> curve25519_arch is a weak reference. It either gets satisfied at
>>>> module load time, or it doesn't.
>>>> 
>>>> If it does get satisfied, the relocations covering the movw/movt pair
>>>> and the one covering the bl instruction get updated so that they point
>>>> to the arch routine.
>>>> 
>>>> If it does not get satisfied, the relocations are disregarded, in
>>>> which case the cbz instruction at offset 0x10 jumps over the bl call.
>>>> 
>>>> Note that this does not involve any memory accesses. It does involve
>>>> some code patching, but only of the kind the module loader already
>>>> does.
>>> 
>>> Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress?
>>> 
>> 
>> Indeed, the arch module needs to be loaded first
>> 
> 
> Actually, this can be addressed by retaining the module dependencies
> as before, but permitting the arch module to be omitted at load time.

I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too.

> 
>>> I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded.
>> 
>> That is what I am doing for chacha and poly
>> 
>>> Or use static calls.
> 
> Given that static calls don't actually exist yet, I propose to proceed
> with the approach above, and switch to static calls once all
> architectures where it matters have an implementation that does not
> use function pointers (which is how static calls will be implemented
> generically)