On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen <pvanleeuwen@xxxxxxxxxxxxxx> wrote: > > But even the CPU only thing may have several implementations, of which > you want to select the fastest one supported by the _detected_ CPU > features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.) > Do you think this would still be efficient if that would be some > large if-else tree? Also, such a fixed implementation wouldn't scale. Just a note on this part. Yes, with retpoline a large if-else tree is actually *way* better for performance these days than even just one single indirect call. I think the cross-over point is somewhere around 20 if-statements. But those kinds of things also are things that we already handle well with instruction rewriting, so they can actually have even less of an overhead than a conditional branch. Using code like if (static_cpu_has(X86_FEATURE_AVX2)) actually ends up patching the code at run-time, so you end up having just an unconditional branch. Exactly because CPU feature choices often end up being in critical code-paths where you have one-or-the-other kind of setup. And yes, one of the big users of this is very much the crypto library code. The code to do the above is disgusting, and when you look at the generated code you see odd unreachable jumps and what looks like a slow "bts" instruction that does the testing dynamically. And then the kernel instruction stream gets rewritten fairly early during the boot depending on the actual CPU capabilities, and the dynamic tests get overwritten by a direct jump. Admittedly I don't think the arm64 people go to quite those lengths, but it certainly wouldn't be impossible there either. It just takes a bit of architecture knowledge and a strong stomach ;) Linus