Is accessing floating-point register slower than accessing cache-memory?

Gioh Kim <gioh.kim@xxxxxxx> · Thu, 18 Sep 2014 15:04:29 +0900

Hello.

I'm working with cortex-a9 based platform which has 4 ARMv8-cores.
I ran the Geekbench and found that SHA1 and SHA2 test have very poor performance.

I disassemble the SHA1 and SHA2 test code (sha1.o and sha2.o),
found they were using floating-point registers often even though SHA does only integer operation
like followings:
(compiler is gcc-linaro-aarch64-linux-gnu-4.9-2014.08_linux.tar.bz2 from http://releases.linaro.org/14.08/components/toolchain/binaries)

   fmov    s21, w5
   fmov    s20, w9
   add     w9, w6, w4
   fmov    w4, s0
   ror     w5, w15, 25
   ror     w19, w15, 11
   fmov    s0, w5
   eor     w19, w19, w4
   fmov    w5, s20
   fmov    w4, s21
   add     w14, w9, w14
   fmov    w9, s0

I think the fp regs were using to backup register.

I'd read an article, http://www.informit.com/articles/article.aspx?p=1620207&seqNum=4, so I guessed the poor performance
might be caused by accessing floating-point reg.
I added -mgeneral-regs-only option and got better performance, almost 200% better.

I'm wondering that the accessing floating-point register really can be slower than accessing cache or ddr memory?
If so, why is gcc generating code using floating-pointer register for non-floating-point calculation?