On 20/02/14 20:39, Cody Rigney wrote: > Thanks for the advice. I didn't realize before that volatile was > actually hiding the problem. > > Do you mind providing an example of what you mean by using a "static > inline" function? That sounds like a better way of managing the > assembly. I know what you mean, but I would like to see an example of > the details (like passing parameters, etc). Suppose you have an assembly instruction "foo <dest> <src>". You would write something like this: static inline uint32_t foo(uint32_t x) { uint32_t y; asm (" foo %[dest], %[src] " : [dest] "=r" (y) : [src] "r" (x)); return y; } Then you would use it in code as "y = foo(x);" and the compiler would put in the single assembly line (plus any code needed to put x into a register - let the compiler handle that sort of thing). This lets you keep the messing inline assembly stuff separate from the algorithm code that actually uses it. > > Initially, I began writing the NEON acceleration in intrinsics. > Then, I read more and more about NEON intrinsics being much slower > when compiled with gcc, due to some stack pops and pushes that fill it > up. Apparently, the Microsoft ARM compiler and Apple's ARM compiler > do well with NEON intrinsics, but GCC does not. So I switched to > inline assembly. I haven't actually tested this myself, but since > OpenCV is cross-platform, I wanted to make the acceleration work > cross-platform in the fastest way. Don't believe random stuff you read on the internet about compiler speeds - test it yourself. One key reason is that the internet never forgets, and information is seldom dated - perhaps the intrinsics /were/ slow when first introduced in gcc 4.3 (or whatever), but they could be much faster with 4.8. The other issue is that you have to have optimisation enabled (sometimes even -O3 or extra specific optimisation flags, and often -ffast-math) to get the kind of scheduling, loop unrolling, and other optimisations needed to get the best out of NEON. The internet is full of people compiling without optimisation and then complaining about the slow code. It is even conceivable that the latest gcc advances in auto-vectorisation can generate good enough neon code without using intrinsics or inline assembly (I don't know if the auto-vectorisation stuff supports ARM/NEON yet). So make /small/ test cases that let you see exactly what is happening. Use the intrinsics, pull things out into small and clear functions (if they are "static" then the compiler will be able to inline them) so that you can separate your logic from the low-level mechanics, and examine the generated assembly for the critical parts. Only go for inline assembly if it will really make a difference. mvh., David > > Thanks, > > Cody > > On Thu, Feb 20, 2014 at 4:54 AM, David Brown <david@xxxxxxxxxxxxxxx> wrote: >> Hi, >> >> I haven't read through the code at all, but I will give you a little >> general advice. >> >> Try to cut the code to the absolute minimum that shows the problem. It >> makes it easier for you to work with and check, and it makes it easier >> for other people to examine. Also make sure that the code has no other >> dependencies such as extra headers - ideally people should be able to >> compile the code themselves and test it (I realise this is difficult for >> those who don't have an ARM handy). >> >> Code that works without optimisation but fails with optimisation, or >> that works when you make a variable volatile, is always a bug. >> Occasionally, it is a bug in the compiler - but most often it is a bug >> in the code. Either way, it is important to figure out the root cause, >> and not try to hide it by making things volatile (though that might be a >> good temporary fix for a compiler bug). >> >> I am not familiar with Neon (and not as good as I should be at ARM >> assembly in general), but it looks to me that you have used specific >> registers in your inline assembly, and assumed specific registers for >> compiler use (such as variables). Don't do that. When you have turned >> off all optimisation, the compiler is consistent about which registers >> it uses for different purposes - when optimising, it changes register >> usage in a very unpredictable way. You must be explicit - all data >> going into your assembly must be declared, as must all data coming out >> of the assembly. And if you use specific registers, you need to tell >> the compiler about them (as "clobbers") - and be aware that the compiler >> might be using those registers for the input or output values. >> >> Getting inline assembly right is not easy, and it is often best to work >> with several small assembly statements rather than large ones - I >> usually make a "static inline" function around a line or two of inline >> assembly and then use that function in the code as needed. It can make >> the result a lot clearer, and makes it easier to mix the C and assembly >> - the end result is often better than I would make in pure assembly. >> >> Finally, is there a good reason why you need inline assembly rather than >> the neon intrinsics provided by gcc? >> >> <http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html> >> >> >> mvh., >> >> David >> >> >> >> >> On 19/02/14 20:04, Cody Rigney wrote: >>> Hi, >>> >>> I'm trying to add NEON optimizations to OpenCV's LK optical flow. See >>> link below. >>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp >>> >>> The gcc version could vary since this is an open source project, but >>> the one I'm currently using is 4.8.1. The target architecture is ARMv7 >>> w/ NEON. The processor I'm testing on is an ARM >>> Cortex-A15(big.LITTLE). >>> >>> The problem is, in release mode (where optimizations are set) it does >>> not work properly. However, in debug mode, it works fine. I tracked >>> down a specific variable(FLT_SCALE) that was being optimized out and >>> made it volatile and that part worked fine after that. However, I'm >>> still having incorrect behavior from some other optimization. I'm new >>> to inline assembly, so I thought maybe I'm doing something wrong >>> that's not telling the compiler that I'm using a certain variable. >>> >>> Below is the code at its current state. Ignore all the comments and >>> volatiles(for testing this problem) everywhere. It's WIP. I removed >>> unnecessary functions and code so it would be easier to see. I think >>> the problem is in the bottom-most asm block because if I do if(false) >>> to skip it, I don't run into the problem. Thanks. >>> >> >> <snip> >> >>