Re: duplicate a variable!!!!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/04/2011 15:39, VAUGHAN Jay wrote:
I am not sure if that is the "proper" way to do it - because I am
far from convinced that there /is/ a good way to harden software
against memory errors using only software.

You harden your software so that, if the hardware fails, the software
can detect the problem and deal with it accordingly .. It very rarely
works out the other way around. :)


Watchdogs can help hardware with some kinds software errors - but such use is usually "abuse" and a misunderstanding of what watchdogs can do.

Software can detect and deal with hardware problems, but only if the software is running on more reliable hardware than the hardware with problems.

The purpose for duplicating variables is to detect when a bitflip has
occurred due to cosmic radiation or some such highly rare, but still
feasible, event, and then be able to deal with it in software,
without producing some wrong effect.


That can make sense if you have some types of memory that are more susceptible to errors than others. It would be useful, for example, if you have a radiation-hardened microcontroller running from internal memory or hardened ROM, but with weaker external external data memory. But if your code is in equally vulnerable memory, you gain nothing by trying to duplicate variables.

"Real" solutions to hardening systems against unexpected errors in
memory are done in hardware. The most obvious case is to use ECC
memory.

The safety-critical world has been taught to *never* rely on this.
Bitflips do occur in the field, ECC be damned .. they are rare, but
they do happen.  ECC protection ends at the RAM bus .. what happens
if a bitflip occurs in CPU register?  Its rare, but it does occur.


Safety and reliability is not an absolute - it's a sliding scale, and a compromise. ECC will give you a lot of extra reliability against likely bitflips - the chances of a bitflip happening in a DRAM cell in a typical system vastly outweigh the chances of a bitflip in a CPU register. So ECC gives you a big step up, for very little extra cost.

Depending on the processor, you may also have ECC on internal memories and caches. But in the end, you are correct - few processors have ECC or any other redundancy by the time you get to the CPU itself. That's why you duplicate the CPU if you are still concerned.


Realistically, radiation causes enough bitflips in DRAM cells that most servers have ECC memory. It causes so few bitflips in processors that most systems ignore that possibility - other risks greatly outweigh processor bitflips. When you are talking about something that is so safety-critical (or cost-critical, or high-risk - such as in space) that it is a real concern, then you duplicate or triplicate the processor, and/or you use radiation-hardened devices, and/or you use external shielding, and/or you use specialised processor designs with ECC right through to the register level.


Certainly nothing in software can be of any help when you are facing unreliabilities in the processor itself.


For more advanced reliability, you use two processor cores in
lock-step (this is done in some car engine controllers, for
example).

Or 2-out-of-3 configurations, and so on ..

The next step up is to do things in triplicate and use majority
voting (common on satellites and other space systems).

Common on rail/transportation systems, too - check my sig, this is
what I do.. ;)


Maybe I should be taking a slightly more humble tone here :-) I have done some safety-critical embedded software development, but not at your level - in the systems I have done, there has always been a human that can override the system in the event of failure, and it is always safe to switch off.

"Hardening" software by hacking the compiler to generate duplicate
variables sounds like an academic exercise at best.

It could be a very interesting feature if incorporated into gcc
mainline, some day, though ... I like the idea of having built-in
variable duplication as a compile option, but I have no idea how it
would be done in a way that satisfies GCC as a whole.


Well, if you are correct that such variable duplication is a help (I still don't see how, but you are more qualified than me to talk about it), then I agree that it would be a cool feature to have in gcc.

However, generally when I have seen discussions about the use of gcc in safety-critical development, the main concerns seem to be about testing, validation and certification of the compiler using things like Plum Hall. Do you use gcc for such safety-critical systems, and if so do you do any sort of certification for it?

The other feature that gcc could gain that would improve its use in safety-critical systems is a set of warnings for MISRA compliance. Much as I hate MISRA, it is a standard that is often used in such systems. And while bitflips due to radiation do occur, I think application code bugs are a much more common source of problems. Compile-time warnings and error checking are steadily improving with each new version of gcc, but I'd say that more work here would have greater benefits for code safety than automatic variable duplication.




[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux