I'm having a heck of a time getting GCC to perform a lane to register lane transfer among D registers. I have the following C-code: #define set_high_from_high(d, m) \ d=vsetq_lane_u64(vgetq_lane_u64(m,LANE_H64),d,LANE_H64); uint64x2_t x, m; ... set_high_from_high(x, m); GCC is generating something like: mov v1.2d[0], x0 mov x0, v2.2d[0] Instead of: mov v1.2d[0], v2.2d[0] I've abandoned inline functions in favor of defines. I've also tried with and without the 'd=' in the define. How do I instruct GCC to perform the NEON to NEON lane transfer? ***** I know it can be done because Clang is doing it. GCC is lagging behind Clang by about 4 cycles per byte. Here's some relative counts: GCC at -O3 $ gdb -batch -ex 'disassemble BLAKE2_NEON_Compress64' ./blake2.o | wc -l 2021 Clang at -O3 $ gdb -batch -ex 'disassemble BLAKE2_NEON_Compress64' ./blake2.o | wc -l 445