Re: asm volatile("":::"memory) uncertainty.

Tom Udale <tom@xxxxxxxxxxxx> · Tue, 10 May 2016 11:58:05 -0400

Hi David and Andrew,

On 5/10/2016 8:44 AM, David Brown wrote:
On 10/05/16 11:24, Andrew Haley wrote:
On 09/05/16 22:40, Tom Udale wrote:

My confusion is whether or not the memory cobberer prevents movement of
statements in addition to flushing any values that may be in registers
before the compiler_barrier() line.  i.e. it is unclear if there is any
control over what statements the memory being flushed is attached to (if
this makes any sense).

Memory barriers only prevent code which accesses memory from being
moved.

It's not going to be easy to do what you're asking for in the
presence of optimization.  GCC orders instruction generation by
dependency.  I'd simply put all the code you want to time into a
separate function in a separate compilation unit and do this:

   start = starttime();  foo();  end = endtime();

Alteratively, it might be possible to define asm inputs and outputs
so that your asm barriers depend on the statements you're trying
to measure.

Andrew.

An easy way is to make sure your "do some stuff" has at least one thing
that it depends on, and at least one result.  You can even make the
dependencies completely artificial.  But the key is to make this bits
volatile:

volatile unsigned ct;
volatile unsigned time;
volatile unsigned v1;
volatile unsigned v2;

{
	ct = CNT;
	v2 = doSomeStuff(v1);
	time = CNT - ct;
}

I had a feeling that volatile would solve a lot of this.  I was hoping 
there was something a little less of a sledgehammer.  volatile 
unfortunately conflates code movement (which in this instance I care 
about) with memory caching (which in this instance is fine).

But from what Andrew said it looks like anything that does not touch 
memory is subject to movement anyway.

Interestingly the code I am struggling with is indeed exclusively memory 
accesses (to my understanding at least):

while(1)
{
        ct=CNT;
        //...
        // Do a bunch of other stuff.
        //...
        // Record the actual voltage/dacval we programmed above.
        // We want that so we will have a consistent data set
        // for p_calcLoad next time around the loop.
        calcLoad_vDes=vDes;
        calcLoad_vDesShortProt = vDesShortProt;
        calcLoad_rawDacVal32 = rawDacVal32;

        // Increment counters.
        loopIndex++;
        status->stimLoopIndex=loopIndex;
        nextCycleTime+=DBUS_LOOP_TIME;
        status->checkSum=CalcStimStatusCheckSum(status);

        // Record total loop time.
        COMPILER_BARRIER();
        status->stimLoopTime=CNT-ct;
}

The resulting asm is:

 341:cpx_stimdriver.cogc ****   COMPILER_BARRIER();
 627              		.loc 2 341 0
 342:cpx_stimdriver.cogc ****   status->stimLoopTime=CNT-ct;
 628              		.loc 2 342 0
 629 0514 0000BCA0 		mov	r7, CNT
 630 0518 0000BCA0 		mov	r5, r14
 631 051c 0000BC84 		sub	r7, r8
 632 0520 1800FC80 		add	r5, #24
 633 0524 1800FCA0 		mov	r4, #24
 634 0528 0000BC80 		add	r4, sp
 337:cpx_stimdriver.cogc ****   nextCycleTime+=DBUS_LOOP_TIME;
 635              		.loc 2 337 0
 636 052c 0000BCA0 		mov	r8, r13
 637              	.LVL65
 332:cpx_stimdriver.cogc ****   calcLoad_rawDacVal32 = rawDacVal32;
 638              		.loc 2 332 0
 639 0530 0000BCA0 		mov	r10, r12
 342:cpx_stimdriver.cogc ****   status->stimLoopTime=CNT-ct;
 640              		.loc 2 342 0
 641 0534 00003C08 		wrlong	r7, r5
 331:cpx_stimdriver.cogc ****   calcLoad_vDesShortProt = vDesShortProt;
 642              		.loc 2 331 0
 643 0538 0800FCA0 		mov	r5, #8
 644 053c 0000BC80 		add	r5, sp
 645 0540 00003C08 		wrlong	r11, r5
 211:cpx_stimdriver.cogc **** /*<*/   vDes=src->desiredV;
 646              		.loc 2 211 0
 647 0544 0400FCA0 		mov	r7, #4
 648 0548 1C00FCA0 		mov	r5, #28
 649 054c 0000BC08 		rdlong	r6, sp
 650 0550 0000BC80 		add	r7, sp
 651 0554 0000BC80 		add	r5, sp
 652 0558 00003C08 		wrlong	r6, r7
 653 055c 0000BC08 		rdlong	r1, r4
 654 0560 0000BC08 		rdlong	lr, r5
 655 0564 00007C5C 		jmp	#.L12

To my mind, the only reason that these would not be considered "memory 
references" is because the compiler has already decided to put the 
memory into registers (true these are mostly automatic variables so they 
can easily be put into registers).  But from an abstract machine 
standpoint a=b is a memory read and a memory write is it not (or is that 
a totally incorrect statement)?  It seems odd therefore that a register 
allocation choice can fundamentally alter program behavior.

It appears that the "memory" clobberer is operating at a level much 
lower than the C statement which means you _really_ need to be careful 
with it since what is memory and  what is not could easily change with 
the addition or removal of reference (i.e. something that once was held 
in memory is not any longer because a removed reference made another 
variable be better suited to the stack allowing the one-time memory 
variable the better candidate for a register).

You can also use inline assembly to force dependency calculations:

// Ensure that "val" has been calculated before next volatile access
// by requiring it as an assembly input.  Note that only volatiles are
ordered!
#define forceDependency(val) \
		asm volatile("" :: "" (val) : )

So this pretty much creates a NOP reference to val that will ensure it 
is cannot be moved after forceDependency?

I know that only volatile variables are ordered and can be interleaved 
with non-volatiles.  But your macro works even for non-volatile vals 
right?  My logic being that "asm volatile" prevents the forceDependency 
from moving.  And the reference to val forces val to be fully stored 
before forceDependency, even if it is a register.  No?

As a digression, I guess what is actually needed to solve this problem 
_completely_ is an explicit keyword that enforces the barrier so no code 
or data movement occurs.

Would such a thing be difficult to implement?  Or is the architecture of 
GCC aligned against it?

I ask because I think this issue might be a real problem for this 
target.  The timing example I am complaining about above is not the real 
deal breaker for this target.  This is:

// do stuff
doSomthingThatWeMustWaitFor();
waitcnt(CNT+SOMETIMEWEMUSTWAIT);
// do other stuff

or if you are trying to minimize stalled CPU time:

// do stuff
doSomthingThatWeMustWaitFor();
ct=CNT;
// do a little stuff
waitcnt(ct+SOMETIMEWEMUSTWAIT);
// do other stuff

waitcnt waits for clock cycles, so SOMETIMEWEMUSTWAIT can be very small 
(i.e. you can use it to wait for SPI/I2C clock periods and such).  If 
you waitcnt for a time in the past however, the thread locks until a 
0xFFFFFFFF cycles have elapsed.  Even at 80MHz, this is a while.  Thus 
if the compiler decides to put a bunch of stuff between the read of CNT 
and the call to waitcnt, it can lock the thread for about a minute. 
That would be annoying to discover in a less taken branch.  Both CNT and 
waitcnt are compiler intrinsics that decay to a single opcode so 
wrapping it all in a noinline function would be a hit.

This is a tremendously common idiom that you see all over code for this 
target so one can imagine that subtle bugs could creep in easily.

Even making lots of things volatile does not help because even a single 
non-volatile could sneak in there.  You have to make _everything_ 
non-volatile which would be a tremendous space and time hit.

So having a keyword that would just kill code movement across it would 
be excellent:

doSomthingThatWeMustWaitFor();
__codeBarrier;
waitcnt(CNT+SOMETIMEWEMUSTWAIT);
__codeBarrier;

If it is not a complete impossibility I might suggest it to the 
propeller guys.

Anyway, thank you both for your comments and advice.

Best regards,

Tom