On Mon, Apr 20, 2020 at 1:25 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > ...but also some kind of barrier semantic, right? Because there are > systems that want some guarantees when they can commit or otherwise > shoot the machine if they can not. The optimal model would likely be a new instruction that could be done in user space and test for it, possibly without any exception at all (because the thing that checks for errors is also presumably the only thing that can decide how to recover - so raising an exception doesn't necessarily help). Together with a way for the kernel to save/restore the exception state on task switch (presumably in the xsave area) so that the error state of one process doesn't affect another one. Bonus points if it's all per-security level, so that a pure user-level error report doesn't poison the kernel state and vice versa. That is _very_ similar to how FPU exceptions work right now. User space can literally do an operation that creates an error on one CPU, get re-scheduled to another one, and take the actual signal and read the exception state on that other CPU. (Of course, the "not even take an exception" part is different). An alternate very simple model that doesn't require any new instructions and no new architecturally visible state (except of course the actual error data) would be to just be raising a *maskable* trap (with the Intel definition of trap vs exception: a trap happens _after_ the instruction). The trap could be on the next instruction if people really want to be that precise, but I don't think it even matters. If it's delayed until the next serializing instruction, that would probably be just fine too. But the important thing is that it (a) is a trap, not an exception - so the instruction has been done, and you don't need to try to emulate it or anything to continue. (b) is maskable, so that the trap handler can decide to just mask it and return (and set a separate flag to then handle it later) With domain transfers either being barriers, or masking it (so NMI and external interrupts would presumably mask it for latency reasons)? I dunno. Wild handwaving. But much better than that crazy unrecoverable machine check model. Linus