Hi Dave, On 11/14/19 9:09 AM, Jan Kiszka wrote: > On 13.11.19 22:24, Dave Hansen wrote: >> On 11/13/19 12:23 AM, Paolo Bonzini wrote: >>> On 13/11/19 07:38, Jan Kiszka wrote: >>>> When reading MCE, error code 0150h, ie. SRAR, I was wondering if that >>>> couldn't simply be handled by the host. But I suppose the symptom of >>>> that erratum is not "just" regular recoverable MCE, rather >>>> sometimes/always an unrecoverable CPU state, despite the error code, >>>> right? >>> The erratum documentation talks explicitly about hanging the system, but >>> it's not clear if it's just a result of the OS mishandling the MCE, or >>> something worse. So I don't know. :( Pawan, do you? >> >> It's "something worse". >> >> I built a kernel module reproducer for this a long time ago. The >> symptom I observed was the whole system hanging hard, requiring me to go >> hit the power button. The MCE software machinery was not involved at >> all from what I could tell. > > Thanks for clarifying this - too bad. > >> >> About creating a unit test, I'd be personally happy to share my >> reproducer, but I built it before this issue was root-caused. There are I'd appreciate if you could share your code. >> actually quite a few underlying variants and a good unit test would make >> sure to exercise all of them. My reproducer probably only exercised a >> single case. Still, it triggers the issue, that's enough to compare it to my reproducer. >> > > Would be interesting to see this. Ralf and tried something quickly, but > there seems to be a detail missing or wrong. Yep, we still can't reproduce the issue on an affected CPU, and don't know what we miss. Thanks, Ralf > > Jan >