> Realistically, radiation causes enough bitflips in DRAM cells that most > servers have ECC memory. It causes so few bitflips in processors that > most systems ignore that possibility - other risks greatly outweigh > processor bitflips. When you are talking about something that is so > safety-critical (or cost-critical, or high-risk - such as in space) that > it is a real concern, then you duplicate or triplicate the processor, > and/or you use radiation-hardened devices, and/or you use external > shielding, and/or you use specialised processor designs with ECC right > through to the register level. ... and then, on top of that, you duplicate your variables into different parts of RAM and do a bit comparison before you use them for some purpose, just in case you have the chance to catch something - and perform a safety reaction - in time to avert disaster. > Certainly nothing in software can be of any help when you are facing > unreliabilities in the processor itself. Aggressive Online testing is really the only thing you can do, to detect such errors in the first place and - hopefully - have time to perform a suitable safety reaction. > Maybe I should be taking a slightly more humble tone here :-) I have > done some safety-critical embedded software development, but not at your > level - in the systems I have done, there has always been a human that > can override the system in the event of failure, and it is always safe > to switch off. Well when you've got hundreds of tonnes of train hurtling down the track, and the light says "RED", and your computer says "GREEN" because its got a problem, you don't have time for humans to get involved. ;) I'm no expert on this issue, I merely work in the group of experts providing an industrial-standard solution, and I'm learning a lot too. > Well, if you are correct that such variable duplication is a help (I > still don't see how, but you are more qualified than me to talk about > it), then I agree that it would be a cool feature to have in gcc. It is not supposed to be a full solution - early detection of RAM corruption can only help bring the system, which is being depended on, offline so that other backup systems can be inserted in place. > However, generally when I have seen discussions about the use of gcc in > safety-critical development, the main concerns seem to be about testing, > validation and certification of the compiler using things like Plum > Hall. Do you use gcc for such safety-critical systems, and if so do you > do any sort of certification for it? Of course, all validation of the compiler and certification is a requirement as part of its uses for this purpose .. > The other feature that gcc could gain that would improve its use in > safety-critical systems is a set of warnings for MISRA compliance. Much > as I hate MISRA, it is a standard that is often used in such systems. Its one thing to hate standards, its another thing to implement them, ship them as part of a product, and reliably see the fruits of such labour in safety statistics. ;) > And while bitflips due to radiation do occur, I think application code > bugs are a much more common source of problems. Absolutely, there are no absolutes! :) > Compile-time warnings > and error checking are steadily improving with each new version of gcc, > but I'd say that more work here would have greater benefits for code > safety than automatic variable duplication. I concur - and this is why I suggest that anyone looking at a safety-critical application, wanting to do variable duplication for protective reasons, implement it themselves. -- ; Thales Austria GmbH Jay Vaughan, Scheydgasse 41 Software Developer 1210 Vienna AUSTRIA ============================================--------------------