Re: gfortran not using SIMD / ymm registers while LLVM (Julia) does

Chris Elrod <elrodc@xxxxxxxxx> · Thu, 7 Jun 2018 11:23:02 -0500

Searching the mailing list, I found out about:

-mprefer-vector-width=256

(When someone was asking for help, trying to use zmm instead of ymm ha ha)

Just tried it and checked the assmbly.
It works with the intrinsic matmul!  Sample:

vmulpd -120(%rsp), %ymm15, %ymm2
vmulpd -24(%rsp), %ymm15, %ymm4
vmovupd -64(%rdx), %ymm0
vmulpd 40(%rsp), %ymm15, %ymm3
vmulpd 104(%rsp), %ymm15, %ymm5
vmulpd -88(%rsp), %ymm15, %ymm1
vmulpd 8(%rsp), %ymm15, %ymm7
vmulpd 72(%rsp), %ymm15, %ymm6
vmulpd 136(%rsp), %ymm15, %ymm15
vfmadd231pd %ymm14, %ymm0, %ymm2
vfmadd231pd %ymm13, %ymm0, %ymm4
vfmadd231pd %ymm12, %ymm0, %ymm3
vfmadd231pd %ymm11, %ymm0, %ymm5
vfmadd231pd %ymm10, %ymm0, %ymm1
vfmadd231pd %ymm9, %ymm0, %ymm7

However -- and I guess this is why gcc so strongly preferred xmm --
this code is much slower. Run time is now over 170 ns.
That is almost 3 times slower than the Julia code using ymm registers.

The prefer vector width option, however, did not work with unrolled code
that looks like this:
real(8), dimension(64), intent(in) :: A, B
real(8), dimension(64), intent(out) :: C
C(1) = A(1) * B(1) + A(9) * B(2) + A(17) * B(3) + A(25) * B(4)
C(1) = C(1) + A(33) * B(5) + A(41) * B(6) + A(49) * B(7) + A(57) * B(8)
C(2) = A(2) * B(1) + A(10) * B(2) + A(18) * B(3) + A(26) * B(4)
C(2) = C(2) + A(34) * B(5) + A(42) * B(6) + A(50) * B(7) + A(58) * B(8)
C(3) = A(3) * B(1) + A(11) * B(2) + A(19) * B(3) + A(27) * B(4)
C(3) = C(3) + A(35) * B(5) + A(43) * B(6) + A(51) * B(7) + A(59) * B(8)

Now if gcc was doing the right (faster) thing in ignoring my wishes -- why
is it suddenly slower when using 256-wide registers instead of 128?
How can llvm instead be much faster? Especially when it looks like the
assembly is now using similar instructions?

Is there anything that could fix this performance gap?

Comparing the code, I see a lot of instructions like
vpermpd
vunpckhpd
in the Fortran assembly absent from Julia.
I tried sending an email with attachments to show the assembly outputs, but
got a MAILER-DAEMON.

Chris

On Thu, Jun 7, 2018 at 10:34 AM, Chris Elrod <elrodc@xxxxxxxxx> wrote:

> Hi everyone,
>
> I mostly use Julia (an LLVM language), but have been toying around a
> little with gcc for numerical code, because
> 1) Fun/learning
> 2) Julia's JIT is basically a lazy static compiler, but the statically
> compiled code is not saved between sessions (ie, exiting and restarting a
> Julia REPL). If some code takes 10 seconds to compile, you may prefer to do
> that only once.
> 3) When using pointers or references in Julia, it will refuse to use
> vector instructions and numerical code slows down dramatically. There is no
> "restrict". [My hack is now, for when I can guarantee that they don't
> actually alias, is to use the function "code_llvm", which returns llvm code
> generated given a set of input types, pass it types that cannot alias, and
> then llvmcall to use that highly vectorized code for types that
> theoretically can).
>
> Anyway, trying with (my results are similar with gcc-8):
> $ gcc-trunk -v
> Using built-in specs.
> COLLECT_GCC=gcc-trunk
> COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-
> linux-gnu/9.0.0/lto-wrapper
> Target: x86_64-pc-linux-gnu
> Configured with: ../gcc-trunk/configure --program-suffix=-trunk :
> (reconfigured) ../gcc-trunk/configure --program-suffix=-trunk
> --enable-languages=c,c++,fortran,lto,objc --no-create --no-recursion
> Thread model: posix
> gcc version 9.0.0 20180524 (experimental) (GCC)
>
>
> For a simple dot product function, gcc gives me output like:
> vfmadd231pd %xmm7, %xmm3, %xmm1
> vmulpd %xmm10, %xmm8, %xmm3
> vfmadd231pd %xmm9, %xmm4, %xmm3
> vaddpd %xmm1, %xmm3, %xmm1
> vaddpd %xmm1, %xmm0, %xmm0
>
>
> while Julia + LLVM(3.9), the assembly looks more like:
>
> 	vmovupd	-96(%rdx), %ymm5
> 	vmovupd	-64(%rdx), %ymm6
> 	vmovupd	-32(%rdx), %ymm7
> 	vmovupd	(%rdx), %ymm8
> 	vmulpd	-96(%rdi), %ymm5, %ymm5
> 	vmulpd	-64(%rdi), %ymm6, %ymm6
> 	vmulpd	-32(%rdi), %ymm7, %ymm7
> 	vmulpd	(%rdi), %ymm8, %ymm8
> 	vaddpd	%ymm5, %ymm0, %ymm0
> 	vaddpd	%ymm6, %ymm2, %ymm2
> 	vaddpd	%ymm7, %ymm3, %ymm3
> 	vaddpd	%ymm8, %ymm4, %ymm4
>
>
>
>
> I also tested unrolled matrix operations, since I wanted to try creating
> some kernels for a linear algebra library.
> For an 8x8 matrix multiplication, the unrolled expressions take Fortran
> about 120 ns, while the builtin matmul comes in at 90ns, and the assembly
> looks much cleaner -- but still using only xmm registers.
>
> Julia/LLVM (6.0) however happily uses chiefly ymm registers:
>
> vmovupd %ymm9, -128(%rsp)
> vmovupd -64(%rsp), %ymm9
> vfmadd231pd %ymm0, %ymm2, %ymm14
> vfmadd231pd %ymm15, %ymm2, %ymm5
>
> with LLVM IR:
> ; Function add_fast; {
> ; Location: fastmath.jl:163
>   %403 = fadd fast <4 x double> %402, %400
>   %404 = fadd fast <4 x double> %403, %401
> ;}}}
> ; Function mul_fast; {
> ; Location: fastmath.jl:165
>   %405 = fmul fast <4 x double> %383, %30
>   %406 = fmul fast <4 x double> %386, %34
>   %407 = fmul fast <4 x double> %389, %38
>   %408 = fmul fast <4 x double> %392, %42
> ;}
>
>
> and clocks in at about 58 ns median on my machine -- again, a substantial
> improvement.
>
> I'm happy to share any code (add as attachments?).
> Is there any way I can encourage gcc to use avx instructions / ymm
> registers?
>
> I'd been using:
> -march=native -Ofast -shared -fPIC
>
> and tried adding ` -fvect-cost-model=unlimited` as well as a mix of other
> random options in hopes of encouraging it to produce faster code.
> Any ideas, suggestions, something obvious I missed?
>
> Or any reason why gcc prefers not to generate avx instructions, even when
> -march=native is given?
> For reference, native == znver1.
>
> Thanks,
> Chris
>
>
>
>
>
>