[AArch64] Missed vectorization opportunity in cactusADM

"Ekanathan, Saravanan" <Saravanan.Ekanathan@xxxxxxx> · Thu, 26 Mar 2015 09:35:47 +0000

(Not sure, if I can send this to gcc-patches/gcc-bugs, as I neither have a patch nor a small reproducible testcase. So, sending to gcc-help)
Hi,
This looks like a missed vectorization opportunity for one of the 'Fortran' hot loops in cactusADM (CPU2006 benchmark) when compiled with "-mcpu=cortex-a57 -Ofast".
Interestingly, the 'generic' model (compiled with plain "-Ofast or -O3" and without -mcpu option) vectorizes this hot loop, hence there is good runtime performance improvement noticed on native Aarch64 platform.

I don't have a small reproducible testcase, hence quoting cactusADM benchmark here.
The hot loop is present in Bench_StaggeredLeapfrog2() in StaggeredLeapfrog2.F file.
For cortex-a57, vectorization report clearly mentions that scalar cost < vector_cost/vectorization_factor, hence didn't vectorize.
For generic case, due to un-tuned vector cost model, the scalar cost >  vector_cost/vectorization_factor  (since scalar_cost = vector_cost), so the loop got vectorized
   << Output of  generic vectorized case>>   StaggeredLeapfrog2.fppized.f.130t.vect:StaggeredLeapfrog2.fppized.f:362:0: note: LOOP VECTORIZED
I have also played around with cortexa57_vector_cost table(esp., scalar_stmt_cost, vector_stmt_cost, vec_unaligned_cost  etc..,), which influences the vectorization decision in this case.
The cortexa57_vector_cost table directly maps to the cost mentioned in "Cortex(r)-A57 Software Optimisation Guide".
But, it looks like there is further scope of tuning the cortexa57 vector cost to vectorize such cases.
Any comments on this missed opportunity ?
Regards,
Saravanan
PS. I am not pasting the hot loop here, as there could be a license issue of using SPEC CPU2006 sources