> As has been mentioned on this thread already, > http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf > is a list of the intrinsics and how they map down to NEON instructions, > thought it's > more of a reference rather than a user guide. > > If you can isolate a standalone example where GCC NEON intrinsics perform > poorly it can you > please file a bug report with the testcase. I hope to get something together shortly. Here's one of the pain points: int64x2_t c = vcombine_s64(vget_high_s64(a),vget_low_s64(b)); I'm testing alternatives at the moment... It looks like lane extraction and insertion produces better code under GCC. It seems to limit GCC's desire to spill out into R registers. > As an aside, I notice your command-line options are sub-optimal. > If you're targeting a Cortex-A7 you want to use -mfpu=neon-vfpv4 rather > than just -mfpu=neon. > This will give you access to the vfma instructions. > Whereas if you're targeting ARMv8-A on a Cortex-A53 you'll want to use > -mfpu=neon-fp-armv8 > to enable the ARMv8 floating-point an NEON instructions. Thanks, this is the sort of thing I was looking for: higher level prescriptions. I'm also looking for something on creating new vectors on the fly from scattered data. vcombine_s64 is a pain point under this data set, and the suggestions here don't apply: https://community.arm.com/groups/processors/blog/2012/03/13/coding-for-neon--part-5-rearranging-vectors. Jeff