Re: duplicate a variable!!!!

David Brown <david@xxxxxxxxxxxxxxx> · Fri, 01 Apr 2011 16:27:54 +0200

On 01/04/2011 15:39, VAUGHAN Jay wrote:
I am not sure if that is the "proper" way to do it - because I am
far from convinced that there /is/ a good way to harden software
against memory errors using only software.

You harden your software so that, if the hardware fails, the software
can detect the problem and deal with it accordingly .. It very rarely
works out the other way around. :)

Watchdogs can help hardware with some kinds software errors - but such 
use is usually "abuse" and a misunderstanding of what watchdogs can do.

Software can detect and deal with hardware problems, but only if the 
software is running on more reliable hardware than the hardware with 
problems.

The purpose for duplicating variables is to detect when a bitflip has
occurred due to cosmic radiation or some such highly rare, but still
feasible, event, and then be able to deal with it in software,
without producing some wrong effect.

That can make sense if you have some types of memory that are more 
susceptible to errors than others.  It would be useful, for example, if 
you have a radiation-hardened microcontroller running from internal 
memory or hardened ROM, but with weaker external external data memory. 
But if your code is in equally vulnerable memory, you gain nothing by 
trying to duplicate variables.

"Real" solutions to hardening systems against unexpected errors in
memory are done in hardware. The most obvious case is to use ECC
memory.

The safety-critical world has been taught to *never* rely on this.
Bitflips do occur in the field, ECC be damned .. they are rare, but
they do happen.  ECC protection ends at the RAM bus .. what happens
if a bitflip occurs in CPU register?  Its rare, but it does occur.

Safety and reliability is not an absolute - it's a sliding scale, and a 
compromise.  ECC will give you a lot of extra reliability against likely 
bitflips - the chances of a bitflip happening in a DRAM cell in a 
typical system vastly outweigh the chances of a bitflip in a CPU 
register.  So ECC gives you a big step up, for very little extra cost.

Depending on the processor, you may also have ECC on internal memories 
and caches.  But in the end, you are correct - few processors have ECC 
or any other redundancy by the time you get to the CPU itself.  That's 
why you duplicate the CPU if you are still concerned.

Realistically, radiation causes enough bitflips in DRAM cells that most 
servers have ECC memory.  It causes so few bitflips in processors that 
most systems ignore that possibility - other risks greatly outweigh 
processor bitflips.  When you are talking about something that is so 
safety-critical (or cost-critical, or high-risk - such as in space) that 
it is a real concern, then you duplicate or triplicate the processor, 
and/or you use radiation-hardened devices, and/or you use external 
shielding, and/or you use specialised processor designs with ECC right 
through to the register level.

Certainly nothing in software can be of any help when you are facing 
unreliabilities in the processor itself.

For more advanced reliability, you use two processor cores in
lock-step (this is done in some car engine controllers, for
example).

Or 2-out-of-3 configurations, and so on ..

The next step up is to do things in triplicate and use majority
voting (common on satellites and other space systems).

Common on rail/transportation systems, too - check my sig, this is
what I do.. ;)

Maybe I should be taking a slightly more humble tone here :-)  I have 
done some safety-critical embedded software development, but not at your 
level - in the systems I have done, there has always been a human that 
can override the system in the event of failure, and it is always safe 
to switch off.

"Hardening" software by hacking the compiler to generate duplicate
variables sounds like an academic exercise at best.

It could be a very interesting feature if incorporated into gcc
mainline, some day, though ... I like the idea of having built-in
variable duplication as a compile option, but I have no idea how it
would be done in a way that satisfies GCC as a whole.

Well, if you are correct that such variable duplication is a help (I 
still don't see how, but you are more qualified than me to talk about 
it), then I agree that it would be a cool feature to have in gcc.

However, generally when I have seen discussions about the use of gcc in 
safety-critical development, the main concerns seem to be about testing, 
validation and certification of the compiler using things like Plum 
Hall.  Do you use gcc for such safety-critical systems, and if so do you 
do any sort of certification for it?

The other feature that gcc could gain that would improve its use in 
safety-critical systems is a set of warnings for MISRA compliance.  Much 
as I hate MISRA, it is a standard that is often used in such systems. 
And while bitflips due to radiation do occur, I think application code 
bugs are a much more common source of problems.  Compile-time warnings 
and error checking are steadily improving with each new version of gcc, 
but I'd say that more work here would have greater benefits for code 
safety than automatic variable duplication.