> On Oct 5, 2019, at 12:24 AM, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > > On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: >> >>> On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>> >>> >>> >>>> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: >>>> >>>> On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: >>>>> >>>>>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: >>>>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: >>>>>>> >>>>>> ... >>>>>>> >>>>>>> In the future, I would like to extend these interfaces to use static calls, >>>>>>> so that the accelerated implementations can be [un]plugged at runtime. For >>>>>>> the time being, we rely on weak aliases and conditional exports so that the >>>>>>> users of the library interfaces link directly to the accelerated versions, >>>>>>> but without the ability to unplug them. >>>>>>> >>>>>> >>>>>> As it turns out, we don't actually need static calls for this. >>>>>> Instead, we can simply permit weak symbol references to go unresolved >>>>>> between modules (as we already do in the kernel itself, due to the >>>>>> fact that ELF permits it), and have the accelerated code live in >>>>>> separate modules that may not be loadable on certain systems, or be >>>>>> blacklisted by the user. >>>>> >>>>> You're saying that at module insertion time, the kernel will override >>>>> weak symbols with those provided by the module itself? At runtime? >>>>> >>>> >>>> Yes. >>>> >>>>> Do you know offhand how this patching works? Is there a PLT that gets >>>>> patched, and so the calls all go through a layer of function pointer >>>>> indirection? Or are all call sites fixed up at insertion time and the >>>>> call instructions rewritten with some runtime patching magic? >>>>> >>>> >>>> No magic. Take curve25519 for example, when built for ARM: >>>> >>>> 00000000 <curve25519>: >>>> 0: f240 0300 movw r3, #0 >>>> 0: R_ARM_THM_MOVW_ABS_NC curve25519_arch >>>> 4: f2c0 0300 movt r3, #0 >>>> 4: R_ARM_THM_MOVT_ABS curve25519_arch >>>> 8: b570 push {r4, r5, r6, lr} >>>> a: 4604 mov r4, r0 >>>> c: 460d mov r5, r1 >>>> e: 4616 mov r6, r2 >>>> 10: b173 cbz r3, 30 <curve25519+0x30> >>>> 12: f7ff fffe bl 0 <curve25519_arch> >>>> 12: R_ARM_THM_CALL curve25519_arch >>>> 16: b158 cbz r0, 30 <curve25519+0x30> >>>> 18: 4620 mov r0, r4 >>>> 1a: 2220 movs r2, #32 >>>> 1c: f240 0100 movw r1, #0 >>>> 1c: R_ARM_THM_MOVW_ABS_NC .LANCHOR0 >>>> 20: f2c0 0100 movt r1, #0 >>>> 20: R_ARM_THM_MOVT_ABS .LANCHOR0 >>>> 24: f7ff fffe bl 0 <__crypto_memneq> >>>> 24: R_ARM_THM_CALL __crypto_memneq >>>> 28: 3000 adds r0, #0 >>>> 2a: bf18 it ne >>>> 2c: 2001 movne r0, #1 >>>> 2e: bd70 pop {r4, r5, r6, pc} >>>> 30: 4632 mov r2, r6 >>>> 32: 4629 mov r1, r5 >>>> 34: 4620 mov r0, r4 >>>> 36: f7ff fffe bl 0 <curve25519_generic> >>>> 36: R_ARM_THM_CALL curve25519_generic >>>> 3a: e7ed b.n 18 <curve25519+0x18> >>>> >>>> curve25519_arch is a weak reference. It either gets satisfied at >>>> module load time, or it doesn't. >>>> >>>> If it does get satisfied, the relocations covering the movw/movt pair >>>> and the one covering the bl instruction get updated so that they point >>>> to the arch routine. >>>> >>>> If it does not get satisfied, the relocations are disregarded, in >>>> which case the cbz instruction at offset 0x10 jumps over the bl call. >>>> >>>> Note that this does not involve any memory accesses. It does involve >>>> some code patching, but only of the kind the module loader already >>>> does. >>> >>> Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress? >>> >> >> Indeed, the arch module needs to be loaded first >> > > Actually, this can be addressed by retaining the module dependencies > as before, but permitting the arch module to be omitted at load time. I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too. > >>> I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. >> >> That is what I am doing for chacha and poly >> >>> Or use static calls. > > Given that static calls don't actually exist yet, I propose to proceed > with the approach above, and switch to static calls once all > architectures where it matters have an implementation that does not > use function pointers (which is how static calls will be implemented > generically)