On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > > On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > > > > > > > > > On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > > > > > > On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > >> > > >>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: > > >>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > > >>>> > > >>> ... > > >>>> > > >>>> In the future, I would like to extend these interfaces to use static calls, > > >>>> so that the accelerated implementations can be [un]plugged at runtime. For > > >>>> the time being, we rely on weak aliases and conditional exports so that the > > >>>> users of the library interfaces link directly to the accelerated versions, > > >>>> but without the ability to unplug them. > > >>>> > > >>> > > >>> As it turns out, we don't actually need static calls for this. > > >>> Instead, we can simply permit weak symbol references to go unresolved > > >>> between modules (as we already do in the kernel itself, due to the > > >>> fact that ELF permits it), and have the accelerated code live in > > >>> separate modules that may not be loadable on certain systems, or be > > >>> blacklisted by the user. > > >> > > >> You're saying that at module insertion time, the kernel will override > > >> weak symbols with those provided by the module itself? At runtime? > > >> > > > > > > Yes. > > > > > >> Do you know offhand how this patching works? Is there a PLT that gets > > >> patched, and so the calls all go through a layer of function pointer > > >> indirection? Or are all call sites fixed up at insertion time and the > > >> call instructions rewritten with some runtime patching magic? > > >> > > > > > > No magic. Take curve25519 for example, when built for ARM: > > > > > > 00000000 <curve25519>: > > > 0: f240 0300 movw r3, #0 > > > 0: R_ARM_THM_MOVW_ABS_NC curve25519_arch > > > 4: f2c0 0300 movt r3, #0 > > > 4: R_ARM_THM_MOVT_ABS curve25519_arch > > > 8: b570 push {r4, r5, r6, lr} > > > a: 4604 mov r4, r0 > > > c: 460d mov r5, r1 > > > e: 4616 mov r6, r2 > > > 10: b173 cbz r3, 30 <curve25519+0x30> > > > 12: f7ff fffe bl 0 <curve25519_arch> > > > 12: R_ARM_THM_CALL curve25519_arch > > > 16: b158 cbz r0, 30 <curve25519+0x30> > > > 18: 4620 mov r0, r4 > > > 1a: 2220 movs r2, #32 > > > 1c: f240 0100 movw r1, #0 > > > 1c: R_ARM_THM_MOVW_ABS_NC .LANCHOR0 > > > 20: f2c0 0100 movt r1, #0 > > > 20: R_ARM_THM_MOVT_ABS .LANCHOR0 > > > 24: f7ff fffe bl 0 <__crypto_memneq> > > > 24: R_ARM_THM_CALL __crypto_memneq > > > 28: 3000 adds r0, #0 > > > 2a: bf18 it ne > > > 2c: 2001 movne r0, #1 > > > 2e: bd70 pop {r4, r5, r6, pc} > > > 30: 4632 mov r2, r6 > > > 32: 4629 mov r1, r5 > > > 34: 4620 mov r0, r4 > > > 36: f7ff fffe bl 0 <curve25519_generic> > > > 36: R_ARM_THM_CALL curve25519_generic > > > 3a: e7ed b.n 18 <curve25519+0x18> > > > > > > curve25519_arch is a weak reference. It either gets satisfied at > > > module load time, or it doesn't. > > > > > > If it does get satisfied, the relocations covering the movw/movt pair > > > and the one covering the bl instruction get updated so that they point > > > to the arch routine. > > > > > > If it does not get satisfied, the relocations are disregarded, in > > > which case the cbz instruction at offset 0x10 jumps over the bl call. > > > > > > Note that this does not involve any memory accesses. It does involve > > > some code patching, but only of the kind the module loader already > > > does. > > > > Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress? > > > > Indeed, the arch module needs to be loaded first > Actually, this can be addressed by retaining the module dependencies as before, but permitting the arch module to be omitted at load time. > > I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. > > That is what I am doing for chacha and poly > > > Or use static calls. Given that static calls don't actually exist yet, I propose to proceed with the approach above, and switch to static calls once all architectures where it matters have an implementation that does not use function pointers (which is how static calls will be implemented generically)