Re: asm volatile("":::"memory) uncertainty.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi David and Andrew,

On 5/10/2016 8:44 AM, David Brown wrote:
On 10/05/16 11:24, Andrew Haley wrote:
On 09/05/16 22:40, Tom Udale wrote:

My confusion is whether or not the memory cobberer prevents movement of
statements in addition to flushing any values that may be in registers
before the compiler_barrier() line.  i.e. it is unclear if there is any
control over what statements the memory being flushed is attached to (if
this makes any sense).

Memory barriers only prevent code which accesses memory from being
moved.

It's not going to be easy to do what you're asking for in the
presence of optimization.  GCC orders instruction generation by
dependency.  I'd simply put all the code you want to time into a
separate function in a separate compilation unit and do this:

   start = starttime();  foo();  end = endtime();

Alteratively, it might be possible to define asm inputs and outputs
so that your asm barriers depend on the statements you're trying
to measure.

Andrew.


An easy way is to make sure your "do some stuff" has at least one thing
that it depends on, and at least one result.  You can even make the
dependencies completely artificial.  But the key is to make this bits
volatile:

volatile unsigned ct;
volatile unsigned time;
volatile unsigned v1;
volatile unsigned v2;

{
	ct = CNT;
	v2 = doSomeStuff(v1);
	time = CNT - ct;
}


I had a feeling that volatile would solve a lot of this. I was hoping there was something a little less of a sledgehammer. volatile unfortunately conflates code movement (which in this instance I care about) with memory caching (which in this instance is fine).

But from what Andrew said it looks like anything that does not touch memory is subject to movement anyway.

Interestingly the code I am struggling with is indeed exclusively memory accesses (to my understanding at least):

while(1)
{
        ct=CNT;
        //...
        // Do a bunch of other stuff.
        //...
        // Record the actual voltage/dacval we programmed above.
        // We want that so we will have a consistent data set
        // for p_calcLoad next time around the loop.
        calcLoad_vDes=vDes;
        calcLoad_vDesShortProt = vDesShortProt;
        calcLoad_rawDacVal32 = rawDacVal32;

        // Increment counters.
        loopIndex++;
        status->stimLoopIndex=loopIndex;
        nextCycleTime+=DBUS_LOOP_TIME;
        status->checkSum=CalcStimStatusCheckSum(status);

        // Record total loop time.
        COMPILER_BARRIER();
        status->stimLoopTime=CNT-ct;
}

The resulting asm is:

 341:cpx_stimdriver.cogc ****   COMPILER_BARRIER();
 627              		.loc 2 341 0
 342:cpx_stimdriver.cogc ****   status->stimLoopTime=CNT-ct;
 628              		.loc 2 342 0
 629 0514 0000BCA0 		mov	r7, CNT
 630 0518 0000BCA0 		mov	r5, r14
 631 051c 0000BC84 		sub	r7, r8
 632 0520 1800FC80 		add	r5, #24
 633 0524 1800FCA0 		mov	r4, #24
 634 0528 0000BC80 		add	r4, sp
 337:cpx_stimdriver.cogc ****   nextCycleTime+=DBUS_LOOP_TIME;
 635              		.loc 2 337 0
 636 052c 0000BCA0 		mov	r8, r13
 637              	.LVL65
 332:cpx_stimdriver.cogc ****   calcLoad_rawDacVal32 = rawDacVal32;
 638              		.loc 2 332 0
 639 0530 0000BCA0 		mov	r10, r12
 342:cpx_stimdriver.cogc ****   status->stimLoopTime=CNT-ct;
 640              		.loc 2 342 0
 641 0534 00003C08 		wrlong	r7, r5
 331:cpx_stimdriver.cogc ****   calcLoad_vDesShortProt = vDesShortProt;
 642              		.loc 2 331 0
 643 0538 0800FCA0 		mov	r5, #8
 644 053c 0000BC80 		add	r5, sp
 645 0540 00003C08 		wrlong	r11, r5
 211:cpx_stimdriver.cogc **** /*<*/   vDes=src->desiredV;
 646              		.loc 2 211 0
 647 0544 0400FCA0 		mov	r7, #4
 648 0548 1C00FCA0 		mov	r5, #28
 649 054c 0000BC08 		rdlong	r6, sp
 650 0550 0000BC80 		add	r7, sp
 651 0554 0000BC80 		add	r5, sp
 652 0558 00003C08 		wrlong	r6, r7
 653 055c 0000BC08 		rdlong	r1, r4
 654 0560 0000BC08 		rdlong	lr, r5
 655 0564 00007C5C 		jmp	#.L12



To my mind, the only reason that these would not be considered "memory references" is because the compiler has already decided to put the memory into registers (true these are mostly automatic variables so they can easily be put into registers). But from an abstract machine standpoint a=b is a memory read and a memory write is it not (or is that a totally incorrect statement)? It seems odd therefore that a register allocation choice can fundamentally alter program behavior.

It appears that the "memory" clobberer is operating at a level much lower than the C statement which means you _really_ need to be careful with it since what is memory and what is not could easily change with the addition or removal of reference (i.e. something that once was held in memory is not any longer because a removed reference made another variable be better suited to the stack allowing the one-time memory variable the better candidate for a register).






You can also use inline assembly to force dependency calculations:

// Ensure that "val" has been calculated before next volatile access
// by requiring it as an assembly input.  Note that only volatiles are
ordered!
#define forceDependency(val) \
		asm volatile("" :: "" (val) : )


So this pretty much creates a NOP reference to val that will ensure it is cannot be moved after forceDependency?

I know that only volatile variables are ordered and can be interleaved with non-volatiles. But your macro works even for non-volatile vals right? My logic being that "asm volatile" prevents the forceDependency from moving. And the reference to val forces val to be fully stored before forceDependency, even if it is a register. No?


As a digression, I guess what is actually needed to solve this problem _completely_ is an explicit keyword that enforces the barrier so no code or data movement occurs.

Would such a thing be difficult to implement? Or is the architecture of GCC aligned against it?

I ask because I think this issue might be a real problem for this target. The timing example I am complaining about above is not the real deal breaker for this target. This is:

// do stuff
doSomthingThatWeMustWaitFor();
waitcnt(CNT+SOMETIMEWEMUSTWAIT);
// do other stuff

or if you are trying to minimize stalled CPU time:

// do stuff
doSomthingThatWeMustWaitFor();
ct=CNT;
// do a little stuff
waitcnt(ct+SOMETIMEWEMUSTWAIT);
// do other stuff


waitcnt waits for clock cycles, so SOMETIMEWEMUSTWAIT can be very small (i.e. you can use it to wait for SPI/I2C clock periods and such). If you waitcnt for a time in the past however, the thread locks until a 0xFFFFFFFF cycles have elapsed. Even at 80MHz, this is a while. Thus if the compiler decides to put a bunch of stuff between the read of CNT and the call to waitcnt, it can lock the thread for about a minute. That would be annoying to discover in a less taken branch. Both CNT and waitcnt are compiler intrinsics that decay to a single opcode so wrapping it all in a noinline function would be a hit.

This is a tremendously common idiom that you see all over code for this target so one can imagine that subtle bugs could creep in easily.

Even making lots of things volatile does not help because even a single non-volatile could sneak in there. You have to make _everything_ non-volatile which would be a tremendous space and time hit.

So having a keyword that would just kill code movement across it would be excellent:

doSomthingThatWeMustWaitFor();
__codeBarrier;
waitcnt(CNT+SOMETIMEWEMUSTWAIT);
__codeBarrier;


If it is not a complete impossibility I might suggest it to the propeller guys.


Anyway, thank you both for your comments and advice.



Best regards,

Tom











[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux