On 04/04/2011 09:43, VAUGHAN Jay wrote:
Realistically, radiation causes enough bitflips in DRAM cells that
most servers have ECC memory. It causes so few bitflips in
processors that most systems ignore that possibility - other risks
greatly outweigh processor bitflips. When you are talking about
something that is so safety-critical (or cost-critical, or
high-risk - such as in space) that it is a real concern, then you
duplicate or triplicate the processor, and/or you use
radiation-hardened devices, and/or you use external shielding,
and/or you use specialised processor designs with ECC right through
to the register level.
... and then, on top of that, you duplicate your variables into
different parts of RAM and do a bit comparison before you use them
for some purpose, just in case you have the chance to catch something
- and perform a safety reaction - in time to avert disaster.
Is that really a win, overall? There is no point in doing any error
checking unless you know how to handle an error - so this means more
code, probably more data, and more scope for errors (both design and
programming errors, and run-time errors due to bit fails in the extra
code and data memory).
/All/ error detection or correction systems add more to the overall
system, so you have to weigh the potential benefits against the
potential risks. With ECC on the memory, the implementation is clear
and its scope is limited - it is possible to analyse the costs and
benefits and see that it is (normally) overall a benefit. But
duplicating some variables and checking them at certain points sounds
very ad hoc to me, and the it is difficult to be sure you are doing more
good than harm. If I felt the need to duplicate data into different
parts of memory (or different memory chips), then I'd prefer to do it
simply and cleanly in hardware.
Certainly nothing in software can be of any help when you are
facing unreliabilities in the processor itself.
Aggressive Online testing is really the only thing you can do, to
detect such errors in the first place and - hopefully - have time to
perform a suitable safety reaction.
Maybe I should be taking a slightly more humble tone here :-) I
have done some safety-critical embedded software development, but
not at your level - in the systems I have done, there has always
been a human that can override the system in the event of failure,
and it is always safe to switch off.
Well when you've got hundreds of tonnes of train hurtling down the
track, and the light says "RED", and your computer says "GREEN"
because its got a problem, you don't have time for humans to get
involved. ;)
See point 9 in
<http://www.jokesunlimited.com/jokes/if_microsoft_made_cars.html> :-)
I'm no expert on this issue, I merely work in the group of experts
providing an industrial-standard solution, and I'm learning a lot
too.
Well, if you are correct that such variable duplication is a help
(I still don't see how, but you are more qualified than me to talk
about it), then I agree that it would be a cool feature to have in
gcc.
It is not supposed to be a full solution - early detection of RAM
corruption can only help bring the system, which is being depended
on, offline so that other backup systems can be inserted in place.
Yes, these things are always built up with layers of safety features and
backups.
However, generally when I have seen discussions about the use of
gcc in safety-critical development, the main concerns seem to be
about testing, validation and certification of the compiler using
things like Plum Hall. Do you use gcc for such safety-critical
systems, and if so do you do any sort of certification for it?
Of course, all validation of the compiler and certification is a
requirement as part of its uses for this purpose ..
Do you do that validation yourself, or do you buy pre-validated gcc
binaries somewhere?
The other feature that gcc could gain that would improve its use
in safety-critical systems is a set of warnings for MISRA
compliance. Much as I hate MISRA, it is a standard that is often
used in such systems.
Its one thing to hate standards, its another thing to implement them,
ship them as part of a product, and reliably see the fruits of such
labour in safety statistics. ;)
There are some (though not many) parts of MISRA that I think are
directly detrimental to code safety. I am a great believer that clear
and readable code has less chance of errors than unclear code. MISRA
has some rules that are formulated from the days of more limited
compilers and poorer compile-time checking, and that I don't think match
good programming style. It's a long time since I read through MISRA (I
don't have a copy at the moment, though I'll probably get one before
long), but one I remember as standing out was the rule for a style "if
(1 == x) ...". It recommends you think backwards, read backwards and
write backwards, to avoid a design flaw in the C language that is easily
checked by the compiler.
Still, I have no statistics to back up my opinions, and I fully
appreciate the value of standards like MISRA even if I don't like them!
And while bitflips due to radiation do occur, I think application
code bugs are a much more common source of problems.
Absolutely, there are no absolutes! :)
Compile-time warnings and error checking are steadily improving
with each new version of gcc, but I'd say that more work here would
have greater benefits for code safety than automatic variable
duplication.
I concur - and this is why I suggest that anyone looking at a
safety-critical application, wanting to do variable duplication for
protective reasons, implement it themselves.