gfortran not using SIMD / ymm registers while LLVM (Julia) does

Chris Elrod <elrodc@xxxxxxxxx> · Thu, 7 Jun 2018 10:34:00 -0500

Hi everyone,

I mostly use Julia (an LLVM language), but have been toying around a little
with gcc for numerical code, because
1) Fun/learning
2) Julia's JIT is basically a lazy static compiler, but the statically
compiled code is not saved between sessions (ie, exiting and restarting a
Julia REPL). If some code takes 10 seconds to compile, you may prefer to do
that only once.
3) When using pointers or references in Julia, it will refuse to use vector
instructions and numerical code slows down dramatically. There is no
"restrict". [My hack is now, for when I can guarantee that they don't
actually alias, is to use the function "code_llvm", which returns llvm code
generated given a set of input types, pass it types that cannot alias, and
then llvmcall to use that highly vectorized code for types that
theoretically can).

Anyway, trying with (my results are similar with gcc-8):
$ gcc-trunk -v
Using built-in specs.
COLLECT_GCC=gcc-trunk
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/9.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk/configure --program-suffix=-trunk :
(reconfigured) ../gcc-trunk/configure --program-suffix=-trunk
--enable-languages=c,c++,fortran,lto,objc --no-create --no-recursion
Thread model: posix
gcc version 9.0.0 20180524 (experimental) (GCC)

For a simple dot product function, gcc gives me output like:
vfmadd231pd %xmm7, %xmm3, %xmm1
vmulpd %xmm10, %xmm8, %xmm3
vfmadd231pd %xmm9, %xmm4, %xmm3
vaddpd %xmm1, %xmm3, %xmm1
vaddpd %xmm1, %xmm0, %xmm0

while Julia + LLVM(3.9), the assembly looks more like:

	vmovupd	-96(%rdx), %ymm5
	vmovupd	-64(%rdx), %ymm6
	vmovupd	-32(%rdx), %ymm7
	vmovupd	(%rdx), %ymm8
	vmulpd	-96(%rdi), %ymm5, %ymm5
	vmulpd	-64(%rdi), %ymm6, %ymm6
	vmulpd	-32(%rdi), %ymm7, %ymm7
	vmulpd	(%rdi), %ymm8, %ymm8
	vaddpd	%ymm5, %ymm0, %ymm0
	vaddpd	%ymm6, %ymm2, %ymm2
	vaddpd	%ymm7, %ymm3, %ymm3
	vaddpd	%ymm8, %ymm4, %ymm4

I also tested unrolled matrix operations, since I wanted to try creating
some kernels for a linear algebra library.
For an 8x8 matrix multiplication, the unrolled expressions take Fortran
about 120 ns, while the builtin matmul comes in at 90ns, and the assembly
looks much cleaner -- but still using only xmm registers.

Julia/LLVM (6.0) however happily uses chiefly ymm registers:

vmovupd %ymm9, -128(%rsp)
vmovupd -64(%rsp), %ymm9
vfmadd231pd %ymm0, %ymm2, %ymm14
vfmadd231pd %ymm15, %ymm2, %ymm5

with LLVM IR:
; Function add_fast; {
; Location: fastmath.jl:163
  %403 = fadd fast <4 x double> %402, %400
  %404 = fadd fast <4 x double> %403, %401
;}}}
; Function mul_fast; {
; Location: fastmath.jl:165
  %405 = fmul fast <4 x double> %383, %30
  %406 = fmul fast <4 x double> %386, %34
  %407 = fmul fast <4 x double> %389, %38
  %408 = fmul fast <4 x double> %392, %42
;}

and clocks in at about 58 ns median on my machine -- again, a substantial
improvement.

I'm happy to share any code (add as attachments?).
Is there any way I can encourage gcc to use avx instructions / ymm
registers?

I'd been using:
-march=native -Ofast -shared -fPIC

and tried adding ` -fvect-cost-model=unlimited` as well as a mix of other
random options in hopes of encouraging it to produce faster code.
Any ideas, suggestions, something obvious I missed?

Or any reason why gcc prefers not to generate avx instructions, even when
-march=native is given?
For reference, native == znver1.

Thanks,
Chris