Thanks for the example code snippet. As far as the intrinsics go, I looked a little deeper. Apparently, gcc has some bugs with producing optimal NEON assembly per the link below. Some of these bugs have been resolved, others not. I guess for now I'll stick with assembly until I know these have been resolved. Although I'm not sure how the developers of OpenCV would feel about inline assembly in order to merge a pull request. If it won't work for them, I'll just switch to intrinsics. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562 On Fri, Feb 21, 2014 at 5:14 AM, David Brown <david@xxxxxxxxxxxxxxx> wrote: > On 20/02/14 20:39, Cody Rigney wrote: >> Thanks for the advice. I didn't realize before that volatile was >> actually hiding the problem. >> >> Do you mind providing an example of what you mean by using a "static >> inline" function? That sounds like a better way of managing the >> assembly. I know what you mean, but I would like to see an example of >> the details (like passing parameters, etc). > > Suppose you have an assembly instruction "foo <dest> <src>". You would > write something like this: > > static inline uint32_t foo(uint32_t x) { > uint32_t y; > asm (" foo %[dest], %[src] " : [dest] "=r" (y) : [src] "r" (x)); > return y; > } > > Then you would use it in code as "y = foo(x);" and the compiler would > put in the single assembly line (plus any code needed to put x into a > register - let the compiler handle that sort of thing). > > This lets you keep the messing inline assembly stuff separate from the > algorithm code that actually uses it. > > >> >> Initially, I began writing the NEON acceleration in intrinsics. >> Then, I read more and more about NEON intrinsics being much slower >> when compiled with gcc, due to some stack pops and pushes that fill it >> up. Apparently, the Microsoft ARM compiler and Apple's ARM compiler >> do well with NEON intrinsics, but GCC does not. So I switched to >> inline assembly. I haven't actually tested this myself, but since >> OpenCV is cross-platform, I wanted to make the acceleration work >> cross-platform in the fastest way. > > Don't believe random stuff you read on the internet about compiler > speeds - test it yourself. One key reason is that the internet never > forgets, and information is seldom dated - perhaps the intrinsics /were/ > slow when first introduced in gcc 4.3 (or whatever), but they could be > much faster with 4.8. The other issue is that you have to have > optimisation enabled (sometimes even -O3 or extra specific optimisation > flags, and often -ffast-math) to get the kind of scheduling, loop > unrolling, and other optimisations needed to get the best out of NEON. > The internet is full of people compiling without optimisation and then > complaining about the slow code. It is even conceivable that the latest > gcc advances in auto-vectorisation can generate good enough neon code > without using intrinsics or inline assembly (I don't know if the > auto-vectorisation stuff supports ARM/NEON yet). > > So make /small/ test cases that let you see exactly what is happening. > Use the intrinsics, pull things out into small and clear functions (if > they are "static" then the compiler will be able to inline them) so that > you can separate your logic from the low-level mechanics, and examine > the generated assembly for the critical parts. Only go for inline > assembly if it will really make a difference. > > mvh., > > David > > >> >> Thanks, >> >> Cody >> >> On Thu, Feb 20, 2014 at 4:54 AM, David Brown <david@xxxxxxxxxxxxxxx> wrote: >>> Hi, >>> >>> I haven't read through the code at all, but I will give you a little >>> general advice. >>> >>> Try to cut the code to the absolute minimum that shows the problem. It >>> makes it easier for you to work with and check, and it makes it easier >>> for other people to examine. Also make sure that the code has no other >>> dependencies such as extra headers - ideally people should be able to >>> compile the code themselves and test it (I realise this is difficult for >>> those who don't have an ARM handy). >>> >>> Code that works without optimisation but fails with optimisation, or >>> that works when you make a variable volatile, is always a bug. >>> Occasionally, it is a bug in the compiler - but most often it is a bug >>> in the code. Either way, it is important to figure out the root cause, >>> and not try to hide it by making things volatile (though that might be a >>> good temporary fix for a compiler bug). >>> >>> I am not familiar with Neon (and not as good as I should be at ARM >>> assembly in general), but it looks to me that you have used specific >>> registers in your inline assembly, and assumed specific registers for >>> compiler use (such as variables). Don't do that. When you have turned >>> off all optimisation, the compiler is consistent about which registers >>> it uses for different purposes - when optimising, it changes register >>> usage in a very unpredictable way. You must be explicit - all data >>> going into your assembly must be declared, as must all data coming out >>> of the assembly. And if you use specific registers, you need to tell >>> the compiler about them (as "clobbers") - and be aware that the compiler >>> might be using those registers for the input or output values. >>> >>> Getting inline assembly right is not easy, and it is often best to work >>> with several small assembly statements rather than large ones - I >>> usually make a "static inline" function around a line or two of inline >>> assembly and then use that function in the code as needed. It can make >>> the result a lot clearer, and makes it easier to mix the C and assembly >>> - the end result is often better than I would make in pure assembly. >>> >>> Finally, is there a good reason why you need inline assembly rather than >>> the neon intrinsics provided by gcc? >>> >>> <http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html> >>> >>> >>> mvh., >>> >>> David >>> >>> >>> >>> >>> On 19/02/14 20:04, Cody Rigney wrote: >>>> Hi, >>>> >>>> I'm trying to add NEON optimizations to OpenCV's LK optical flow. See >>>> link below. >>>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp >>>> >>>> The gcc version could vary since this is an open source project, but >>>> the one I'm currently using is 4.8.1. The target architecture is ARMv7 >>>> w/ NEON. The processor I'm testing on is an ARM >>>> Cortex-A15(big.LITTLE). >>>> >>>> The problem is, in release mode (where optimizations are set) it does >>>> not work properly. However, in debug mode, it works fine. I tracked >>>> down a specific variable(FLT_SCALE) that was being optimized out and >>>> made it volatile and that part worked fine after that. However, I'm >>>> still having incorrect behavior from some other optimization. I'm new >>>> to inline assembly, so I thought maybe I'm doing something wrong >>>> that's not telling the compiler that I'm using a certain variable. >>>> >>>> Below is the code at its current state. Ignore all the comments and >>>> volatiles(for testing this problem) everywhere. It's WIP. I removed >>>> unnecessary functions and code so it would be easier to see. I think >>>> the problem is in the bottom-most asm block because if I do if(false) >>>> to skip it, I don't run into the problem. Thanks. >>>> >>> >>> <snip> >>> >>> >