Re: arm-none-eabi, nested function trampolines and caching

Matthias Pfaller <leo@xxxxxxxx> · Wed, 29 Nov 2023 13:33:34 +0100

On 2023-11-29 12:52, David Brown wrote:
On 29/11/2023 08:50, Matthias Pfaller wrote:
On 2023-11-28 19:00, David Brown wrote:
 > Can I ask (either or both of you) why you are using are using nested functions like
 > this?  This is possibly the first time I have heard of anyone using them, certainly
 > the first time in embedded development. Even when I programmed in Pascal, where
 > nested functions are part of the language, I did not use them more than a couple of
 > times.
 >
 > What benefit do you see in nested functions in C, compared to having separate
 > functions?  Have you considered moving to C++ and using lambdas, which are more
 > flexible, standard, and can be very much more efficient?
 >
 > This is, of course, straying from the topicality of this mailing list. But I am
 > curious, and I doubt if I am the only one who is.

- I'm maintaining our token threaded forth interpreter. In the inner loop there is 
a absurdly big switch for the primitives. I'm loading rp, sp and tos into local 
variables. pushing, popping  and memory access is done by nested functions 
(checking for stack over and under flows, managing tos, access violations, ...). Of 
course that could be done by macros. But when I'm calling C-functions from within 
the switch I'll sometimes pass pointers to the local functions (e.g. for 
catch/throw exception handling).

- When calling list iterators, I'm sometimes passing references to nested functions

- When locking is necessary and the function has multiple return points I'm doing 
something like:

void somefunction(void)
{
   void f(void)
   {
      ...
   }
   lock();
   f();
   unlock();
}

I know, in a lot of cases I could just define some outer static function or use 
gotos. But to my eye it just looks nicer that way. In most cases there will be no 
trampoline necessary anyway. Its not used that often and we could probably get rid 
of it in most cases by using macros and ({ ... }).

Thanks for that.

I can appreciate that local functions can look nicer than macros or goto spaghetti.  
In simple cases (which is probably the majority for your usage), the local functions 
will be inlined and will give pretty much exactly the same code as you'd get for 
macros, outer static functions, or other methods.  But I'd be very unhappy to see 
trampolines here, as you will need for more complicated cases.  The overheads are not 
something you'd want to see in the inner loop of an interpreter.

AFAIUI, the reason the compiler has to generate trampolines here is to make a 
function that has access to some of the local variables, while being shoe-horned into 
the appearance of a function with parameters that don't include any extra values or 
references.  If you were, as an alternative, to switch to C++ and use lambdas instead 
of nested functions that all disappears precisely because lambdas do not have to be 
forced to match the function signature - the generated lambda can take extra hidden 
parameters (and even extra hidden state) as needed.

Of course it's never easy to change these kinds of things in existing code.  And it 
is particularly difficult to get solutions that work efficiently on a wide range of 
compilers or versions.

David

We are using (at the moment) two micro controllers with cache. The at91sam4e is a 
cortex-m4 device with two kilobytes of unified i/d-cache. Because of this cache must 
only be considered when using DMA.

The atsame7x/atsamv7x series is a cortex-m7 device with 16k i-cache and 16k d-cache. 
Here you have to worry about i-cache invalidates. Evicting a single i-cache (and the 
trampoline code is small) doesn't hurt too much. Especially if it happens very seldom 
(its not like every function passes pointers to nested functions...).

Besides that @300MHz (or @120MHz for the cortex-m4) the core is more than fast enough 
for our applications. The 384k of RAM on the at91sam[ev]7x and the 128k of RAM on the 
at91sam4e are a lot more of a hindrance...

I'm aware of the reason for the trampoline code and because of this I know that (as 
you wrote) in the majority of the cases trampoline code is not needed (because no 
outer arguments or variables are referenced and there is no passing of function 
pointers to other functions).

In the cases where the trampoline code is needed I'm willing to take the performance 
hit in exchange for the gain I get.

e.g. in the example with the interpreter inner loop I would need to pass along all 
kinds of state every time I call an external function needing access to local 
interpreter state. If I pass just a pointer to a callback function (that will then 
access local state) there is much less opportunity for errors... In most of the cases 
passing the callback is not necessary anyway.

Matthias