Just for reference: for the inner loop in "foo<Tvsimple, 2>" I get the following assembler output: .L4: vbroadcastsd (%rax), %ymm2 addq $2, %rcx addq $32, %rax addq $16, %r8 vmovapd 200(%rsp), %ymm15 vbroadcastsd -8(%r8), %ymm11 vfmadd231pd %ymm2, %ymm1, %ymm4 vfmadd231pd %ymm2, %ymm0, %ymm3 vbroadcastsd -24(%rax), %ymm2 vfmadd231pd %ymm2, %ymm1, %ymm6 vfmadd231pd %ymm2, %ymm0, %ymm5 vbroadcastsd -16(%rax), %ymm2 vfmadd231pd %ymm1, %ymm2, %ymm8 vfmadd231pd %ymm0, %ymm2, %ymm7 vbroadcastsd -8(%rax), %ymm2 vmovapd %ymm4, -120(%rsp) vfmadd231pd %ymm1, %ymm2, %ymm10 vfmadd231pd %ymm0, %ymm2, %ymm9 vbroadcastsd -16(%r8), %ymm2 vmovapd %ymm3, -88(%rsp) vmovapd %ymm6, -56(%rsp) vmovapd %ymm2, %ymm14 vfmadd132pd %ymm12, %ymm11, %ymm2 vmovapd %ymm5, -24(%rsp) vfmadd132pd %ymm13, %ymm11, %ymm14 vmovapd %ymm8, 8(%rsp) vfmadd213pd 168(%rsp), %ymm2, %ymm0 vmovapd 232(%rsp), %ymm2 vfmadd213pd 136(%rsp), %ymm14, %ymm1 vmovapd %ymm7, 40(%rsp) vmovapd %ymm10, 72(%rsp) vmovapd %ymm9, 104(%rsp) vmovapd %ymm15, 136(%rsp) vmovapd %ymm2, 168(%rsp) vmovapd %ymm1, 200(%rsp) vmovapd %ymm0, 232(%rsp) cmpq %rcx, %r9 jnb .L4 For "foo<__m256d,2>" I get: .L12: vmovapd %ymm2, %ymm0 vmovapd %ymm3, %ymm1 .L11: vmovapd %ymm15, %ymm3 addq $2, %rcx addq $32, %rax addq $16, %r8 vbroadcastsd -32(%rax), %ymm2 vbroadcastsd -8(%r8), %ymm4 vfmadd231pd %ymm2, %ymm1, %ymm12 vfmadd231pd %ymm2, %ymm0, %ymm8 vbroadcastsd -24(%rax), %ymm2 vfmadd231pd %ymm2, %ymm1, %ymm10 vfmadd231pd %ymm2, %ymm0, %ymm6 vbroadcastsd -16(%rax), %ymm2 vfmadd231pd %ymm2, %ymm1, %ymm11 vfmadd231pd %ymm2, %ymm0, %ymm7 vbroadcastsd -8(%rax), %ymm2 vfmadd231pd %ymm2, %ymm1, %ymm9 vfmadd231pd %ymm2, %ymm0, %ymm5 vbroadcastsd -16(%r8), %ymm2 vfmadd132pd %ymm2, %ymm4, %ymm3 vfmadd132pd -120(%rsp), %ymm4, %ymm2 vfmadd132pd %ymm1, %ymm14, %ymm3 vfmadd132pd %ymm0, %ymm13, %ymm2 vmovapd %ymm1, %ymm14 vmovapd %ymm0, %ymm13 cmpq %rcx, %r9 jnb .L12 And the assembler generated by clang++ is (in both cases, except for minimal differences) .LBB1_2: # =>This Inner Loop Header: Depth=1 vbroadcastsd -16(%rdx,%rsi,2), %ymm14 vfmadd231pd %ymm10, %ymm14, %ymm4 vfmadd231pd %ymm14, %ymm11, %ymm0 vbroadcastsd -8(%rdx,%rsi,2), %ymm14 vfmadd231pd %ymm10, %ymm14, %ymm6 vfmadd231pd %ymm14, %ymm11, %ymm2 vbroadcastsd (%rdx,%rsi,2), %ymm14 vfmadd231pd %ymm10, %ymm14, %ymm5 vfmadd231pd %ymm14, %ymm11, %ymm1 vbroadcastsd 8(%rdx,%rsi,2), %ymm14 vfmadd231pd %ymm10, %ymm14, %ymm7 vfmadd231pd %ymm14, %ymm11, %ymm3 vbroadcastsd -8(%r8,%rsi), %ymm14 vbroadcastsd (%r8,%rsi), %ymm15 vmovapd %ymm9, %ymm8 vfmadd213pd %ymm15, %ymm14, %ymm8 vfmadd213pd %ymm13, %ymm10, %ymm8 vfmadd132pd -56(%rsp), %ymm15, %ymm14 vfmadd213pd %ymm12, %ymm11, %ymm14 addq $2, %rcx addq $16, %rsi vmovapd %ymm11, %ymm12 vmovapd %ymm10, %ymm13 vmovapd %ymm14, %ymm11 vmovapd %ymm8, %ymm10 cmpq %r9, %rcx jbe .LBB1_2 On 3/22/21 3:34 PM, Martin Reinecke wrote: > Hi, > > the attached test case is the (slightly simplified) hot loop from a > library for spherical harmonic transforms. > This code uses explicit vectorization, and I try to use simple wrapper > classes around the primitive vector types (like __m256d) to simplify > operations like initialization with a scalar etc. > > However it seems that using the wrapper type inside the critical loop > causes g++ to produce sub-optimal code. This can be seen by running > > g++ -mfma -O3 -std=c++17 -ffast-math -S testcase.cc > > and inspecting the generated assembler code (I'm using gcc 10.2.1). > The version where I use the wrapper type even in the hot loop (i.e. > "foo<Tvsimple, 2>") has a few unnecessary "vmovapd" instructions before > the end of the loop body, which are missing in the version where I cast > to __m256d before doing the heavy computation (i.e. "foo<__m256d,2>"). > > My suspicion is that the "Tvsimple" type is somehow not completely POD > and that this prohibits g++ from optimizing more aggressively. On the > other hand, clang++ produces identical code for both versions, which is > comparable in speed with the faster version generated by g++. > > Is g++ missing an opportunity to optimize here? If so, is there a way to > alter the "Tvsimple" class so that it doesn't stop g++ from optimizing? > > Thanks, > Martin >