Hi everyone, I mostly use Julia (an LLVM language), but have been toying around a little with gcc for numerical code, because 1) Fun/learning 2) Julia's JIT is basically a lazy static compiler, but the statically compiled code is not saved between sessions (ie, exiting and restarting a Julia REPL). If some code takes 10 seconds to compile, you may prefer to do that only once. 3) When using pointers or references in Julia, it will refuse to use vector instructions and numerical code slows down dramatically. There is no "restrict". [My hack is now, for when I can guarantee that they don't actually alias, is to use the function "code_llvm", which returns llvm code generated given a set of input types, pass it types that cannot alias, and then llvmcall to use that highly vectorized code for types that theoretically can). Anyway, trying with (my results are similar with gcc-8): $ gcc-trunk -v Using built-in specs. COLLECT_GCC=gcc-trunk COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/9.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk/configure --program-suffix=-trunk : (reconfigured) ../gcc-trunk/configure --program-suffix=-trunk --enable-languages=c,c++,fortran,lto,objc --no-create --no-recursion Thread model: posix gcc version 9.0.0 20180524 (experimental) (GCC) For a simple dot product function, gcc gives me output like: vfmadd231pd %xmm7, %xmm3, %xmm1 vmulpd %xmm10, %xmm8, %xmm3 vfmadd231pd %xmm9, %xmm4, %xmm3 vaddpd %xmm1, %xmm3, %xmm1 vaddpd %xmm1, %xmm0, %xmm0 while Julia + LLVM(3.9), the assembly looks more like: vmovupd -96(%rdx), %ymm5 vmovupd -64(%rdx), %ymm6 vmovupd -32(%rdx), %ymm7 vmovupd (%rdx), %ymm8 vmulpd -96(%rdi), %ymm5, %ymm5 vmulpd -64(%rdi), %ymm6, %ymm6 vmulpd -32(%rdi), %ymm7, %ymm7 vmulpd (%rdi), %ymm8, %ymm8 vaddpd %ymm5, %ymm0, %ymm0 vaddpd %ymm6, %ymm2, %ymm2 vaddpd %ymm7, %ymm3, %ymm3 vaddpd %ymm8, %ymm4, %ymm4 I also tested unrolled matrix operations, since I wanted to try creating some kernels for a linear algebra library. For an 8x8 matrix multiplication, the unrolled expressions take Fortran about 120 ns, while the builtin matmul comes in at 90ns, and the assembly looks much cleaner -- but still using only xmm registers. Julia/LLVM (6.0) however happily uses chiefly ymm registers: vmovupd %ymm9, -128(%rsp) vmovupd -64(%rsp), %ymm9 vfmadd231pd %ymm0, %ymm2, %ymm14 vfmadd231pd %ymm15, %ymm2, %ymm5 with LLVM IR: ; Function add_fast; { ; Location: fastmath.jl:163 %403 = fadd fast <4 x double> %402, %400 %404 = fadd fast <4 x double> %403, %401 ;}}} ; Function mul_fast; { ; Location: fastmath.jl:165 %405 = fmul fast <4 x double> %383, %30 %406 = fmul fast <4 x double> %386, %34 %407 = fmul fast <4 x double> %389, %38 %408 = fmul fast <4 x double> %392, %42 ;} and clocks in at about 58 ns median on my machine -- again, a substantial improvement. I'm happy to share any code (add as attachments?). Is there any way I can encourage gcc to use avx instructions / ymm registers? I'd been using: -march=native -Ofast -shared -fPIC and tried adding ` -fvect-cost-model=unlimited` as well as a mix of other random options in hopes of encouraging it to produce faster code. Any ideas, suggestions, something obvious I missed? Or any reason why gcc prefers not to generate avx instructions, even when -march=native is given? For reference, native == znver1. Thanks, Chris