On 01/04/2011 15:39, VAUGHAN Jay wrote:
I am not sure if that is the "proper" way to do it - because I am
far from convinced that there /is/ a good way to harden software
against memory errors using only software.
You harden your software so that, if the hardware fails, the software
can detect the problem and deal with it accordingly .. It very rarely
works out the other way around. :)
Watchdogs can help hardware with some kinds software errors - but such
use is usually "abuse" and a misunderstanding of what watchdogs can do.
Software can detect and deal with hardware problems, but only if the
software is running on more reliable hardware than the hardware with
problems.
The purpose for duplicating variables is to detect when a bitflip has
occurred due to cosmic radiation or some such highly rare, but still
feasible, event, and then be able to deal with it in software,
without producing some wrong effect.
That can make sense if you have some types of memory that are more
susceptible to errors than others. It would be useful, for example, if
you have a radiation-hardened microcontroller running from internal
memory or hardened ROM, but with weaker external external data memory.
But if your code is in equally vulnerable memory, you gain nothing by
trying to duplicate variables.
"Real" solutions to hardening systems against unexpected errors in
memory are done in hardware. The most obvious case is to use ECC
memory.
The safety-critical world has been taught to *never* rely on this.
Bitflips do occur in the field, ECC be damned .. they are rare, but
they do happen. ECC protection ends at the RAM bus .. what happens
if a bitflip occurs in CPU register? Its rare, but it does occur.
Safety and reliability is not an absolute - it's a sliding scale, and a
compromise. ECC will give you a lot of extra reliability against likely
bitflips - the chances of a bitflip happening in a DRAM cell in a
typical system vastly outweigh the chances of a bitflip in a CPU
register. So ECC gives you a big step up, for very little extra cost.
Depending on the processor, you may also have ECC on internal memories
and caches. But in the end, you are correct - few processors have ECC
or any other redundancy by the time you get to the CPU itself. That's
why you duplicate the CPU if you are still concerned.
Realistically, radiation causes enough bitflips in DRAM cells that most
servers have ECC memory. It causes so few bitflips in processors that
most systems ignore that possibility - other risks greatly outweigh
processor bitflips. When you are talking about something that is so
safety-critical (or cost-critical, or high-risk - such as in space) that
it is a real concern, then you duplicate or triplicate the processor,
and/or you use radiation-hardened devices, and/or you use external
shielding, and/or you use specialised processor designs with ECC right
through to the register level.
Certainly nothing in software can be of any help when you are facing
unreliabilities in the processor itself.
For more advanced reliability, you use two processor cores in
lock-step (this is done in some car engine controllers, for
example).
Or 2-out-of-3 configurations, and so on ..
The next step up is to do things in triplicate and use majority
voting (common on satellites and other space systems).
Common on rail/transportation systems, too - check my sig, this is
what I do.. ;)
Maybe I should be taking a slightly more humble tone here :-) I have
done some safety-critical embedded software development, but not at your
level - in the systems I have done, there has always been a human that
can override the system in the event of failure, and it is always safe
to switch off.
"Hardening" software by hacking the compiler to generate duplicate
variables sounds like an academic exercise at best.
It could be a very interesting feature if incorporated into gcc
mainline, some day, though ... I like the idea of having built-in
variable duplication as a compile option, but I have no idea how it
would be done in a way that satisfies GCC as a whole.
Well, if you are correct that such variable duplication is a help (I
still don't see how, but you are more qualified than me to talk about
it), then I agree that it would be a cool feature to have in gcc.
However, generally when I have seen discussions about the use of gcc in
safety-critical development, the main concerns seem to be about testing,
validation and certification of the compiler using things like Plum
Hall. Do you use gcc for such safety-critical systems, and if so do you
do any sort of certification for it?
The other feature that gcc could gain that would improve its use in
safety-critical systems is a set of warnings for MISRA compliance. Much
as I hate MISRA, it is a standard that is often used in such systems.
And while bitflips due to radiation do occur, I think application code
bugs are a much more common source of problems. Compile-time warnings
and error checking are steadily improving with each new version of gcc,
but I'd say that more work here would have greater benefits for code
safety than automatic variable duplication.