Hi David and Andrew,
On 5/10/2016 8:44 AM, David Brown wrote:
On 10/05/16 11:24, Andrew Haley wrote:
On 09/05/16 22:40, Tom Udale wrote:
My confusion is whether or not the memory cobberer prevents movement of
statements in addition to flushing any values that may be in registers
before the compiler_barrier() line. i.e. it is unclear if there is any
control over what statements the memory being flushed is attached to (if
this makes any sense).
Memory barriers only prevent code which accesses memory from being
moved.
It's not going to be easy to do what you're asking for in the
presence of optimization. GCC orders instruction generation by
dependency. I'd simply put all the code you want to time into a
separate function in a separate compilation unit and do this:
start = starttime(); foo(); end = endtime();
Alteratively, it might be possible to define asm inputs and outputs
so that your asm barriers depend on the statements you're trying
to measure.
Andrew.
An easy way is to make sure your "do some stuff" has at least one thing
that it depends on, and at least one result. You can even make the
dependencies completely artificial. But the key is to make this bits
volatile:
volatile unsigned ct;
volatile unsigned time;
volatile unsigned v1;
volatile unsigned v2;
{
ct = CNT;
v2 = doSomeStuff(v1);
time = CNT - ct;
}
I had a feeling that volatile would solve a lot of this. I was hoping
there was something a little less of a sledgehammer. volatile
unfortunately conflates code movement (which in this instance I care
about) with memory caching (which in this instance is fine).
But from what Andrew said it looks like anything that does not touch
memory is subject to movement anyway.
Interestingly the code I am struggling with is indeed exclusively memory
accesses (to my understanding at least):
while(1)
{
ct=CNT;
//...
// Do a bunch of other stuff.
//...
// Record the actual voltage/dacval we programmed above.
// We want that so we will have a consistent data set
// for p_calcLoad next time around the loop.
calcLoad_vDes=vDes;
calcLoad_vDesShortProt = vDesShortProt;
calcLoad_rawDacVal32 = rawDacVal32;
// Increment counters.
loopIndex++;
status->stimLoopIndex=loopIndex;
nextCycleTime+=DBUS_LOOP_TIME;
status->checkSum=CalcStimStatusCheckSum(status);
// Record total loop time.
COMPILER_BARRIER();
status->stimLoopTime=CNT-ct;
}
The resulting asm is:
341:cpx_stimdriver.cogc **** COMPILER_BARRIER();
627 .loc 2 341 0
342:cpx_stimdriver.cogc **** status->stimLoopTime=CNT-ct;
628 .loc 2 342 0
629 0514 0000BCA0 mov r7, CNT
630 0518 0000BCA0 mov r5, r14
631 051c 0000BC84 sub r7, r8
632 0520 1800FC80 add r5, #24
633 0524 1800FCA0 mov r4, #24
634 0528 0000BC80 add r4, sp
337:cpx_stimdriver.cogc **** nextCycleTime+=DBUS_LOOP_TIME;
635 .loc 2 337 0
636 052c 0000BCA0 mov r8, r13
637 .LVL65
332:cpx_stimdriver.cogc **** calcLoad_rawDacVal32 = rawDacVal32;
638 .loc 2 332 0
639 0530 0000BCA0 mov r10, r12
342:cpx_stimdriver.cogc **** status->stimLoopTime=CNT-ct;
640 .loc 2 342 0
641 0534 00003C08 wrlong r7, r5
331:cpx_stimdriver.cogc **** calcLoad_vDesShortProt = vDesShortProt;
642 .loc 2 331 0
643 0538 0800FCA0 mov r5, #8
644 053c 0000BC80 add r5, sp
645 0540 00003C08 wrlong r11, r5
211:cpx_stimdriver.cogc **** /*<*/ vDes=src->desiredV;
646 .loc 2 211 0
647 0544 0400FCA0 mov r7, #4
648 0548 1C00FCA0 mov r5, #28
649 054c 0000BC08 rdlong r6, sp
650 0550 0000BC80 add r7, sp
651 0554 0000BC80 add r5, sp
652 0558 00003C08 wrlong r6, r7
653 055c 0000BC08 rdlong r1, r4
654 0560 0000BC08 rdlong lr, r5
655 0564 00007C5C jmp #.L12
To my mind, the only reason that these would not be considered "memory
references" is because the compiler has already decided to put the
memory into registers (true these are mostly automatic variables so they
can easily be put into registers). But from an abstract machine
standpoint a=b is a memory read and a memory write is it not (or is that
a totally incorrect statement)? It seems odd therefore that a register
allocation choice can fundamentally alter program behavior.
It appears that the "memory" clobberer is operating at a level much
lower than the C statement which means you _really_ need to be careful
with it since what is memory and what is not could easily change with
the addition or removal of reference (i.e. something that once was held
in memory is not any longer because a removed reference made another
variable be better suited to the stack allowing the one-time memory
variable the better candidate for a register).
You can also use inline assembly to force dependency calculations:
// Ensure that "val" has been calculated before next volatile access
// by requiring it as an assembly input. Note that only volatiles are
ordered!
#define forceDependency(val) \
asm volatile("" :: "" (val) : )
So this pretty much creates a NOP reference to val that will ensure it
is cannot be moved after forceDependency?
I know that only volatile variables are ordered and can be interleaved
with non-volatiles. But your macro works even for non-volatile vals
right? My logic being that "asm volatile" prevents the forceDependency
from moving. And the reference to val forces val to be fully stored
before forceDependency, even if it is a register. No?
As a digression, I guess what is actually needed to solve this problem
_completely_ is an explicit keyword that enforces the barrier so no code
or data movement occurs.
Would such a thing be difficult to implement? Or is the architecture of
GCC aligned against it?
I ask because I think this issue might be a real problem for this
target. The timing example I am complaining about above is not the real
deal breaker for this target. This is:
// do stuff
doSomthingThatWeMustWaitFor();
waitcnt(CNT+SOMETIMEWEMUSTWAIT);
// do other stuff
or if you are trying to minimize stalled CPU time:
// do stuff
doSomthingThatWeMustWaitFor();
ct=CNT;
// do a little stuff
waitcnt(ct+SOMETIMEWEMUSTWAIT);
// do other stuff
waitcnt waits for clock cycles, so SOMETIMEWEMUSTWAIT can be very small
(i.e. you can use it to wait for SPI/I2C clock periods and such). If
you waitcnt for a time in the past however, the thread locks until a
0xFFFFFFFF cycles have elapsed. Even at 80MHz, this is a while. Thus
if the compiler decides to put a bunch of stuff between the read of CNT
and the call to waitcnt, it can lock the thread for about a minute.
That would be annoying to discover in a less taken branch. Both CNT and
waitcnt are compiler intrinsics that decay to a single opcode so
wrapping it all in a noinline function would be a hit.
This is a tremendously common idiom that you see all over code for this
target so one can imagine that subtle bugs could creep in easily.
Even making lots of things volatile does not help because even a single
non-volatile could sneak in there. You have to make _everything_
non-volatile which would be a tremendous space and time hit.
So having a keyword that would just kill code movement across it would
be excellent:
doSomthingThatWeMustWaitFor();
__codeBarrier;
waitcnt(CNT+SOMETIMEWEMUSTWAIT);
__codeBarrier;
If it is not a complete impossibility I might suggest it to the
propeller guys.
Anyway, thank you both for your comments and advice.
Best regards,
Tom