On Mon, Apr 20, 2020 at 1:57 PM Luck, Tony <tony.luck@xxxxxxxxx> wrote: > > > (a) is a trap, not an exception - so the instruction has been done, > > and you don't need to try to emulate it or anything to continue. > > Maybe for errors on the data side of the pipeline. On the instruction > side we can usually recover from user space instruction fetches by > just throwing away the page with the corrupted instructions and reading > from disk into a new page. Then just point the page table to the new > page, and hey presto, its all transparently fixed (modulo time lost fixing > things). That's true for things like ECC on real RAM, with traditional executables. It's not so true of something like nvram that you execute out of directly. There is not necessarily a disk to re-read things from. But it's also not true of things like JIT's. They are kind of a big thing. Asking the JIT to do "hey, I faulted at a random point, you need to re-JIT" is no different from all the other "that's a _really_ painful recovery point, please delay it". Sure, the JIT environment will probably just have to kill that thread anyway, but I do think this falls under the same "you're better off giving the _option_ to just continue and hope for the best" than force a non-recoverable state. For regular ECC, I literally would like the machine to just always continue. I'd like to be informed that there's something bad going on (because it might be RAM going bad, but it might also be a rowhammer attack), but the decision to kill things or not should ultimately be the *users*, not the JIT's, not the kernel. So the basic rule should be that you should always have the _option_ to just continue. The corrupted state might not be critical - or it might be the ECC bits themselves, not the data. There are situations where stopping everything is worse than "let's continue as best we can, and inform the user with a big red blinking light". ECC should not make things less reliable, even if it's another 10+% of bits that can go wrong. It should also be noted that even a good ECC pattern _can_ miss corruption if you're unlucky with the corruption. So the whole black-and-white model of "ECC means you need to stop everything" is questionable to begin with, because the signal isn't that absolute in the first place. So when somebody brings up a "what if I use corrupted data and make things worse", they are making an intellectually dishonest argument. What if you saw corrupted data and simply never caught it, because it was a unlucky multi-bit failure"? There is no "absolute" thing about ECC. The only thing that is _never_ wrong is to report it and try to continue, and let some higher-level entity decide what to do. And that final decision might literally be "I ran this simulation for 2 days, I see that there's an error report, I will buy a new machine. For now I'll use the data it generated, but I'll re-run to validate it later". Linus