On 11/13/19 12:23 AM, Paolo Bonzini wrote: > On 13/11/19 07:38, Jan Kiszka wrote: >> When reading MCE, error code 0150h, ie. SRAR, I was wondering if that >> couldn't simply be handled by the host. But I suppose the symptom of >> that erratum is not "just" regular recoverable MCE, rather >> sometimes/always an unrecoverable CPU state, despite the error code, right? > The erratum documentation talks explicitly about hanging the system, but > it's not clear if it's just a result of the OS mishandling the MCE, or > something worse. So I don't know. :( Pawan, do you? It's "something worse". I built a kernel module reproducer for this a long time ago. The symptom I observed was the whole system hanging hard, requiring me to go hit the power button. The MCE software machinery was not involved at all from what I could tell. About creating a unit test, I'd be personally happy to share my reproducer, but I built it before this issue was root-caused. There are actually quite a few underlying variants and a good unit test would make sure to exercise all of them. My reproducer probably only exercised a single case.