Re: Atomic accesses on ARM microcontrollers

David Brown <david.brown@xxxxxxxxxxxx> · Sat, 10 Oct 2020 21:43:08 +0200

On 10/10/2020 14:39, Jonathan Wakely wrote:
> On Fri, 9 Oct 2020 at 19:29, David Brown <david.brown@xxxxxxxxxxxx> wrote:
>>
>> I don't know if this can be answered here, or would be best on the
>> development mailing list.  But I'll start on the help list.
>>
>> I work primarily with microcontrollers, with 32-bit ARM Cortex-M devices
>> being the most common these days.  I've been trying out atomics in gcc,
>> and I find it badly lacking.  (I've tried C11 <stdatomic.h>, C++11
>> <atomic>, and the gcc builtins - they all generate the same results,
>> which is to be expected.)  I'm concentrating on plain loads and stores
>> at the moment, not other atomic operations.
>>
>> These microcontrollers are all single core, so memory ordering does not
>> matter.
>>
>> For 8-bit, 16-bit and 32-bit types, atomic accesses are just simple
>> loads and stores.  These are generated fine.
>>
>> But for 64-bit and above, there are library calls to a compiler-provided
>> library.  For the Cortex M4 and M7 cores (and several other Cortex M
>> cores), the "load double register" and "store double register"
>> instructions are atomic (but not suitable for use with volatile data,
>> since they are restarted if they are interrupted).  The compiler
>> generates these for normal 64-bit types, but not for atomics.
>>
>> For larger types, the situation is far, far worse.  Not only is the
>> library code inefficient on these devices (disabling and re-enabling
>> global interrupts is the optimal solution in most cases, with load/store
>> with reservation being a second option), but it is /wrong/.  The library
>> uses spin locks (AFAICS) - on a single core system, that generally means
>> deadlocking the processor.  That is worse than useless.
>>
>> Is there any way I can replace this library with my own code here, while
>> still using the language atomics?
> 
> Yes. My understanding is that libatomic is designed to be replaceable
> by users who want to provide their own custom implementations of the
> API.
> 
> You're using bare metal ARM, right? For Arm on Linux I think there are
> kernel helpers that make the atomics efficient even when the hardware
> doesn't support them.
> 

Yes, I am using bare metal (well, sometimes an RTOS - but that's still a
lot closer to bare metal than to a host OS like Linux).  And I have a
single core - that makes atomics easier because I don't even need "dmb"
or other memory barrier instructions, and I can freely use "disable
interrupts around the access" strategy.  On the other hand, it means
that the spin locks in libatomic are completely wrong.

If I understand you correctly, you mean that I can simply implement my
own version of __atomic_load_8 and other functions in libatomic?

I had a quick test (using the godbolt.org online compiler).

By adding this to my file:

extern inline
uint64_t __atomic_load_8(const volatile void * p, int order) {
    (void) order;
    const volatile uint64_t * q = (const volatile uint64_t *) p;
    return *q;
}

then a straight load of a 64-bit atomic becomes a single "ldrd" load
double register instruction, which is optimal for this processor.  (In a
finished solution, I'd want to check that this is correct for different
flags - possibly adding function attributes for optimisation or inline
assembly to ensure that it is always correct.  But that's a detail for
me to check.)

The same worked for __atomic_store_8.

(The general load/store functions are a bit more involved, as are the
read-modify-write atomic functions.)

Is this strategy guaranteed to work in gcc, or is it a case of "it works
in a simple test, but might fail in a complicated program or with
different flags" ?