Re: Compiler optimizing variables in inline assembly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



That makes sense.  In this case, the input parameters are actually
memory addresses.  So how would I do an output or clobber that would
tell the compiler that the memory at those addresses will change?

Thanks for your time,

Cody

On Thu, Feb 20, 2014 at 4:14 AM, Andrew Haley <aph@xxxxxxxxxx> wrote:
> Hi,
>
> On 02/19/2014 07:04 PM, Cody Rigney wrote:
>> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
>> link below.
>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
>>
>> The gcc version could vary since this is an open source project, but
>> the one I'm currently using is 4.8.1. The target architecture is ARMv7
>> w/ NEON. The processor I'm testing on is an ARM
>> Cortex-A15(big.LITTLE).
>>
>> The problem is, in release mode (where optimizations are set) it does
>> not work properly. However, in debug mode, it works fine. I tracked
>> down a specific variable(FLT_SCALE) that was being optimized out and
>> made it volatile and that part worked fine after that. However, I'm
>> still having incorrect behavior from some other optimization.
>
> Forget about using volatile here.  That's just wrong.
>
> You have to mark your inputs, outputs, and clobbers correctly.
>
> Look at this asm:
>
>                __asm__ volatile (
>                                   "vld1.16 {q0}, [%0]\n\t" //trow0[x + cn]
>                                   "vld1.16 {q1}, [%1]\n\t" //trow0[x - cn]
>                                   "vsub.i16 q5, q0, q1\n\t" //this is t0
>                                   "vld1.16 {q2}, [%2]\n\t" //trow1[x + cn]
>                                   "vld1.16 {q3}, [%3]\n\t" //trow1[x - cn]
>                                   "vadd.i16 q6, q2, q3\n\t" //this
> needs mult by 3
>                                   "vld1.16 {q4}, [%4]\n\t" //trow1[x]
>                                   "vmul.i16 q7, q6, q8\n\t" //this
> needs to add to trow1[x]*10
>                                   "vmul.i16 q10, q4, q9\n\t" //this is
> trow1[x]*10
>                                   "vadd.i16 q11, q7, q10\n\t" //this is t1
>                                   "vswp d22, d11\n\t"
>                                   "vst2.16 {q5}, [%5]\n\t" //interleave
>                                   "vst2.16 {q11}, [%6]\n\t" //interleave
>                                   :
>                                   : "r" (trow0 + x + cn),  //0
>                                     "r" (trow0 + x - cn),  //1
>                                     "r" (trow1 + x + cn),  //2
>                                     "r" (trow1 + x - cn),  //3
>                                     "r" (trow1 + x),       //4
>                                     "r" (drow + (x*2)),     //5
>                                     "r" (drow + (x*2)+8)   //6
>                                   :
>                                   );
>
> It has no outputs.  How is this possible?  It does a lot of work.  It must
> have some outputs.  I think there should be some outputs for this asm.  I
> think they are memory outputs.
>
> Go through all, the asm blocks, and mark the inputs, outputs, and clobbers.
> Then it should work.
>
> Remember one basic thing: you must tell GCC about everything that an
> asm does.  If it affects memory, you must tell GCC.  If it reads memory,
> you must tell GCC.  DO not lie to the compiler: it will bite you.
>
> Andrew.
>




[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux