On Mon, Apr 20, 2020 at 11:20 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > * I'm at a loss of why you seem to be suggesting that hardware should > / could avoid all exceptions. What else could hardware do besides > throw an exception on consumption of a naturally occuring multi-bit > ECC error? Data is gone, and only software might know how to recover. This is classic bogus thinking. If Intel ever makes ECC DRAM available to everybody, there would be a _shred_ of logic to that thinking, but right now it's some hw designer in their mom's basement that has told you that hardware has to throw a synchronous exception because hardware doesn't know any better. That hardware designer really doesn't have a _clue_ about the big issues. The fact is, a synchronous machine check exception is about the _worst_ thing you can ever do when you encounter something like a memory error. It literally means that the software cannot possibly do anything sane to recover, because the software is in some random place. The hardware designer didn't think about the fact that the low-level access is hidden from the software by a compiler and/or a lot of other infrastructure - maybe microcode, maybe the OS, maybe a scriping language, yadda yadda. Absolutely NOBODY can recover at the level of one instruction. The microcode people already proved that. At the level of "memcpy()", you do not have a recovery option. A hardware designer that tells you that you have to magically recover at an instruction boundary fundamentally DOES NOT UNDERSTAND THE PROBLEM. IOW, you have completely uncritically just taken that incorrect statement of "what else could hardware do" without questioning that hardware designer AT ALL. And remember, this is likely the same hardware designer that told you that it's a good idea to then make machine checks go to every single CPU in the system. And this is the same hardware designer that then didn't even give you enough information to recover. And this is the same hardware designer that made recovery impossible anyway, because if the error happened in microcode or in some other situation, the microcode COULDN'T HANDLE IT EITHER! In other words, you are talking to people WHO ARE KNOWN TO BE INCOMPETENT. Seriously. Question them. When they tell ytou that "it's the only thing we can possibly do", they do so from being incompetent, and we have the history to PROVE it. I don't understand why nobody else seems to be pushing back against the completely insane and known garbage that is the Intel machine checks. They are wrong. The fact is, synchronous errors are the absolute worst possible interface, exactly because they cause problems in various nasty corner cases. We _know_ a lot of those corner cases, for chrissake: - random standard library routine like "memcpy". How the hell do you think a memcpy can recover? It can't. - Unaligned handling where "one" access isn't actually a single access. - microcode. Intel saw this problem themselves, but instead of making people realize "oh, synchronous exceptions don't work that well" and think about the problem, they wasted our time for decades, and then probably spent a lot of effort in trying to make them work. - random generic code that isn't able to handle the fault, because IT SHOULDN'T NEED TO CARE. Low-level filesystems, user mappings, the list just goes on. The only thing that can recover tends to be at a *MUCH* higher level than one instruction access. So the next time somebody tells you "there's nothing else we can do", call them out on being incompetent, and call them out on the fact that history has _shown_ that they are incompetent and wrong. Over and over again. I can _trivially_ point to a number of "what else could we do" that are much better options. (a) let the corrupted value through, notify things ASYNCHRONOUSLY that there were problems, and let people handle the issue later when they are ready to do so. Yeah, the data was corrupt - but not allowing the user to read it may mean that the whole file is now inaccessible, even if it was just a single ECC block that was wrong. I don't know the block-size people use for ECC, and the fact is, software shouldn't really even need to care. I may be able to recover from the situation at a higher level. The data may be recoverable other ways - including just a "I want even corrupted data, because I might have enough context that I can make sense of it anyway". (b) if you have other issues so that you don't have data at all (maybe it was metadata that was corrupted, don't ask me how that would happen), just return zeroes, and notify about the problem ASYNCHRONOUSLY. And when notifying, give as much useful information as possible: the virtual and physical address of the data, please, in addition to maybe lower level bank information. Add a bit for "multiple errors", so that whoever then later tries to maybe recover, can tell if it has complete data or not. The OS can send a SIGBUS with that information to user land that can then maybe recover. Or it can say "hey, I'm in some recovery mode, I'll try to limp along with incomplete data". Sometimes "recover" means "try to limp along, notify the user that their hw is going bad, but try to use the data anyway". Again, what Intel engineers actually did with the synchronous non-recoverable thing was not "the only thing I could possibly have done". It was literally the ABSOLUTE WORST POSSIBLE THING they could do, with potentially a dead machine as a result. And now - after years of pain - you have the gall to repeat that idiocy that you got from people who have shown themselves to be completely wrong in the past? Why? Why do you take their known wrong approach as unthinking gospel? Just because it's been handed down over generations on stone slabs? I really really detest the whole mcsafe garbage. And I absolutely *ABHOR* how nobody inside of Intel has apparently ever questioned the brokenness at a really fundamental level. That "I throw my hands in the air and just give up" thing is a disease. It's absolutely not "what else could we do". Linus