Searching the mailing list, I found out about: -mprefer-vector-width=256 (When someone was asking for help, trying to use zmm instead of ymm ha ha) Just tried it and checked the assmbly. It works with the intrinsic matmul! Sample: vmulpd -120(%rsp), %ymm15, %ymm2 vmulpd -24(%rsp), %ymm15, %ymm4 vmovupd -64(%rdx), %ymm0 vmulpd 40(%rsp), %ymm15, %ymm3 vmulpd 104(%rsp), %ymm15, %ymm5 vmulpd -88(%rsp), %ymm15, %ymm1 vmulpd 8(%rsp), %ymm15, %ymm7 vmulpd 72(%rsp), %ymm15, %ymm6 vmulpd 136(%rsp), %ymm15, %ymm15 vfmadd231pd %ymm14, %ymm0, %ymm2 vfmadd231pd %ymm13, %ymm0, %ymm4 vfmadd231pd %ymm12, %ymm0, %ymm3 vfmadd231pd %ymm11, %ymm0, %ymm5 vfmadd231pd %ymm10, %ymm0, %ymm1 vfmadd231pd %ymm9, %ymm0, %ymm7 However -- and I guess this is why gcc so strongly preferred xmm -- this code is much slower. Run time is now over 170 ns. That is almost 3 times slower than the Julia code using ymm registers. The prefer vector width option, however, did not work with unrolled code that looks like this: real(8), dimension(64), intent(in) :: A, B real(8), dimension(64), intent(out) :: C C(1) = A(1) * B(1) + A(9) * B(2) + A(17) * B(3) + A(25) * B(4) C(1) = C(1) + A(33) * B(5) + A(41) * B(6) + A(49) * B(7) + A(57) * B(8) C(2) = A(2) * B(1) + A(10) * B(2) + A(18) * B(3) + A(26) * B(4) C(2) = C(2) + A(34) * B(5) + A(42) * B(6) + A(50) * B(7) + A(58) * B(8) C(3) = A(3) * B(1) + A(11) * B(2) + A(19) * B(3) + A(27) * B(4) C(3) = C(3) + A(35) * B(5) + A(43) * B(6) + A(51) * B(7) + A(59) * B(8) Now if gcc was doing the right (faster) thing in ignoring my wishes -- why is it suddenly slower when using 256-wide registers instead of 128? How can llvm instead be much faster? Especially when it looks like the assembly is now using similar instructions? Is there anything that could fix this performance gap? Comparing the code, I see a lot of instructions like vpermpd vunpckhpd in the Fortran assembly absent from Julia. I tried sending an email with attachments to show the assembly outputs, but got a MAILER-DAEMON. Chris On Thu, Jun 7, 2018 at 10:34 AM, Chris Elrod <elrodc@xxxxxxxxx> wrote: > Hi everyone, > > I mostly use Julia (an LLVM language), but have been toying around a > little with gcc for numerical code, because > 1) Fun/learning > 2) Julia's JIT is basically a lazy static compiler, but the statically > compiled code is not saved between sessions (ie, exiting and restarting a > Julia REPL). If some code takes 10 seconds to compile, you may prefer to do > that only once. > 3) When using pointers or references in Julia, it will refuse to use > vector instructions and numerical code slows down dramatically. There is no > "restrict". [My hack is now, for when I can guarantee that they don't > actually alias, is to use the function "code_llvm", which returns llvm code > generated given a set of input types, pass it types that cannot alias, and > then llvmcall to use that highly vectorized code for types that > theoretically can). > > Anyway, trying with (my results are similar with gcc-8): > $ gcc-trunk -v > Using built-in specs. > COLLECT_GCC=gcc-trunk > COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc- > linux-gnu/9.0.0/lto-wrapper > Target: x86_64-pc-linux-gnu > Configured with: ../gcc-trunk/configure --program-suffix=-trunk : > (reconfigured) ../gcc-trunk/configure --program-suffix=-trunk > --enable-languages=c,c++,fortran,lto,objc --no-create --no-recursion > Thread model: posix > gcc version 9.0.0 20180524 (experimental) (GCC) > > > For a simple dot product function, gcc gives me output like: > vfmadd231pd %xmm7, %xmm3, %xmm1 > vmulpd %xmm10, %xmm8, %xmm3 > vfmadd231pd %xmm9, %xmm4, %xmm3 > vaddpd %xmm1, %xmm3, %xmm1 > vaddpd %xmm1, %xmm0, %xmm0 > > > while Julia + LLVM(3.9), the assembly looks more like: > > vmovupd -96(%rdx), %ymm5 > vmovupd -64(%rdx), %ymm6 > vmovupd -32(%rdx), %ymm7 > vmovupd (%rdx), %ymm8 > vmulpd -96(%rdi), %ymm5, %ymm5 > vmulpd -64(%rdi), %ymm6, %ymm6 > vmulpd -32(%rdi), %ymm7, %ymm7 > vmulpd (%rdi), %ymm8, %ymm8 > vaddpd %ymm5, %ymm0, %ymm0 > vaddpd %ymm6, %ymm2, %ymm2 > vaddpd %ymm7, %ymm3, %ymm3 > vaddpd %ymm8, %ymm4, %ymm4 > > > > > I also tested unrolled matrix operations, since I wanted to try creating > some kernels for a linear algebra library. > For an 8x8 matrix multiplication, the unrolled expressions take Fortran > about 120 ns, while the builtin matmul comes in at 90ns, and the assembly > looks much cleaner -- but still using only xmm registers. > > Julia/LLVM (6.0) however happily uses chiefly ymm registers: > > vmovupd %ymm9, -128(%rsp) > vmovupd -64(%rsp), %ymm9 > vfmadd231pd %ymm0, %ymm2, %ymm14 > vfmadd231pd %ymm15, %ymm2, %ymm5 > > with LLVM IR: > ; Function add_fast; { > ; Location: fastmath.jl:163 > %403 = fadd fast <4 x double> %402, %400 > %404 = fadd fast <4 x double> %403, %401 > ;}}} > ; Function mul_fast; { > ; Location: fastmath.jl:165 > %405 = fmul fast <4 x double> %383, %30 > %406 = fmul fast <4 x double> %386, %34 > %407 = fmul fast <4 x double> %389, %38 > %408 = fmul fast <4 x double> %392, %42 > ;} > > > and clocks in at about 58 ns median on my machine -- again, a substantial > improvement. > > I'm happy to share any code (add as attachments?). > Is there any way I can encourage gcc to use avx instructions / ymm > registers? > > I'd been using: > -march=native -Ofast -shared -fPIC > > and tried adding ` -fvect-cost-model=unlimited` as well as a mix of other > random options in hopes of encouraging it to produce faster code. > Any ideas, suggestions, something obvious I missed? > > Or any reason why gcc prefers not to generate avx instructions, even when > -march=native is given? > For reference, native == znver1. > > Thanks, > Chris > > > > > >