Re: Compiler optimizing variables in inline assembly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for the example code snippet.

As far as the intrinsics go, I looked a little deeper.  Apparently,
gcc has some bugs with producing optimal NEON assembly per the link
below.  Some of these bugs have been resolved, others not.  I guess
for now I'll stick with assembly until I know these have been
resolved.  Although I'm not sure how the developers of OpenCV would
feel about inline assembly in order to merge a pull request.  If it
won't work for them, I'll just switch to intrinsics.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562


On Fri, Feb 21, 2014 at 5:14 AM, David Brown <david@xxxxxxxxxxxxxxx> wrote:
> On 20/02/14 20:39, Cody Rigney wrote:
>> Thanks for the advice.  I didn't realize before that volatile was
>> actually hiding the problem.
>>
>> Do you mind providing an example of what you mean by using a "static
>> inline" function?  That sounds like a better way of managing the
>> assembly.  I know what you mean, but I would like to see an example of
>> the details (like passing parameters, etc).
>
> Suppose you have an assembly instruction "foo <dest> <src>".  You would
> write something like this:
>
> static inline uint32_t foo(uint32_t x) {
>         uint32_t y;
>         asm (" foo %[dest], %[src] " : [dest] "=r" (y) : [src] "r" (x));
>         return y;
> }
>
> Then you would use it in code as "y = foo(x);" and the compiler would
> put in the single assembly line (plus any code needed to put x into a
> register - let the compiler handle that sort of thing).
>
> This lets you keep the messing inline assembly stuff separate from the
> algorithm code that actually uses it.
>
>
>>
>> Initially, I began writing the NEON acceleration in intrinsics.
>> Then, I read more and more about NEON intrinsics being much slower
>> when compiled with gcc, due to some stack pops and pushes that fill it
>> up.  Apparently, the Microsoft ARM compiler and Apple's ARM compiler
>> do well with NEON intrinsics, but GCC does not. So I switched to
>> inline assembly.  I haven't actually tested this myself, but since
>> OpenCV is cross-platform, I wanted to make the acceleration work
>> cross-platform in the fastest way.
>
> Don't believe random stuff you read on the internet about compiler
> speeds - test it yourself.  One key reason is that the internet never
> forgets, and information is seldom dated - perhaps the intrinsics /were/
> slow when first introduced in gcc 4.3 (or whatever), but they could be
> much faster with 4.8.  The other issue is that you have to have
> optimisation enabled (sometimes even -O3 or extra specific optimisation
> flags, and often -ffast-math) to get the kind of scheduling, loop
> unrolling, and other optimisations needed to get the best out of NEON.
> The internet is full of people compiling without optimisation and then
> complaining about the slow code.  It is even conceivable that the latest
> gcc advances in auto-vectorisation can generate good enough neon code
> without using intrinsics or inline assembly (I don't know if the
> auto-vectorisation stuff supports ARM/NEON yet).
>
> So make /small/ test cases that let you see exactly what is happening.
> Use the intrinsics, pull things out into small and clear functions (if
> they are "static" then the compiler will be able to inline them) so that
> you can separate your logic from the low-level mechanics, and examine
> the generated assembly for the critical parts.  Only go for inline
> assembly if it will really make a difference.
>
> mvh.,
>
> David
>
>
>>
>> Thanks,
>>
>> Cody
>>
>> On Thu, Feb 20, 2014 at 4:54 AM, David Brown <david@xxxxxxxxxxxxxxx> wrote:
>>> Hi,
>>>
>>> I haven't read through the code at all, but I will give you a little
>>> general advice.
>>>
>>> Try to cut the code to the absolute minimum that shows the problem.  It
>>> makes it easier for you to work with and check, and it makes it easier
>>> for other people to examine.  Also make sure that the code has no other
>>> dependencies such as extra headers - ideally people should be able to
>>> compile the code themselves and test it (I realise this is difficult for
>>> those who don't have an ARM handy).
>>>
>>> Code that works without optimisation but fails with optimisation, or
>>> that works when you make a variable volatile, is always a bug.
>>> Occasionally, it is a bug in the compiler - but most often it is a bug
>>> in the code.  Either way, it is important to figure out the root cause,
>>> and not try to hide it by making things volatile (though that might be a
>>> good temporary fix for a compiler bug).
>>>
>>> I am not familiar with Neon (and not as good as I should be at ARM
>>> assembly in general), but it looks to me that you have used specific
>>> registers in your inline assembly, and assumed specific registers for
>>> compiler use (such as variables).  Don't do that.  When you have turned
>>> off all optimisation, the compiler is consistent about which registers
>>> it uses for different purposes - when optimising, it changes register
>>> usage in a very unpredictable way.  You must be explicit - all data
>>> going into your assembly must be declared, as must all data coming out
>>> of the assembly.  And if you use specific registers, you need to tell
>>> the compiler about them (as "clobbers") - and be aware that the compiler
>>> might be using those registers for the input or output values.
>>>
>>> Getting inline assembly right is not easy, and it is often best to work
>>> with several small assembly statements rather than large ones - I
>>> usually make a "static inline" function around a line or two of inline
>>> assembly and then use that function in the code as needed.  It can make
>>> the result a lot clearer, and makes it easier to mix the C and assembly
>>> - the end result is often better than I would make in pure assembly.
>>>
>>> Finally, is there a good reason why you need inline assembly rather than
>>> the neon intrinsics provided by gcc?
>>>
>>> <http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html>
>>>
>>>
>>> mvh.,
>>>
>>> David
>>>
>>>
>>>
>>>
>>> On 19/02/14 20:04, Cody Rigney wrote:
>>>> Hi,
>>>>
>>>> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
>>>> link below.
>>>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
>>>>
>>>> The gcc version could vary since this is an open source project, but
>>>> the one I'm currently using is 4.8.1. The target architecture is ARMv7
>>>> w/ NEON. The processor I'm testing on is an ARM
>>>> Cortex-A15(big.LITTLE).
>>>>
>>>> The problem is, in release mode (where optimizations are set) it does
>>>> not work properly. However, in debug mode, it works fine. I tracked
>>>> down a specific variable(FLT_SCALE) that was being optimized out and
>>>> made it volatile and that part worked fine after that. However, I'm
>>>> still having incorrect behavior from some other optimization.  I'm new
>>>> to inline assembly, so I thought maybe I'm doing something wrong
>>>> that's not telling the compiler that I'm using a certain variable.
>>>>
>>>> Below is the code at its current state. Ignore all the comments and
>>>> volatiles(for testing this problem) everywhere. It's WIP. I removed
>>>> unnecessary functions and code so it would be easier to see. I think
>>>> the problem is in the bottom-most asm block because if I do if(false)
>>>> to skip it, I don't run into the problem. Thanks.
>>>>
>>>
>>> <snip>
>>>
>>>
>




[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux