On Thu, Nov 14, 2019 at 09:13:22AM +0100, Jan Kiszka wrote: > On 14.11.19 00:25, Pawan Gupta wrote: > > On Wed, Nov 13, 2019 at 09:23:30AM +0100, Paolo Bonzini wrote: > > > On 13/11/19 07:38, Jan Kiszka wrote: > > > > When reading MCE, error code 0150h, ie. SRAR, I was wondering if that > > > > couldn't simply be handled by the host. But I suppose the symptom of > > > > that erratum is not "just" regular recoverable MCE, rather > > > > sometimes/always an unrecoverable CPU state, despite the error code, right? > > > > > > The erratum documentation talks explicitly about hanging the system, but > > > it's not clear if it's just a result of the OS mishandling the MCE, or > > > something worse. So I don't know. :( Pawan, do you? > > > > As Dave mentioned in the other email its "something worse". > > > > Although this erratum results in a machine check with the same MCACOD > > signature as an SRAR error (0x150) the MCi_STATUS.PCC bit will be set to > > one. The Intel Software Developers manual says that PCC=1 errors are > > fatal and cannot be recovered. > > > > 15.10.4.1 Machine-Check Exception Handler for Error Recovery [1] > > > > [...] > > The PCC flag in each IA32_MCi_STATUS register indicates whether recovery > > from the error is possible for uncorrected errors (UC=1). If the PCC > > flag is set for enabled uncorrected errors (UC=1 and EN=1), recovery is > > not possible. > > > > And, as Dave observed, even that event is not delivered to software (maybe > just logged by firmware for post-reset analysis) but can or does cause a > machine lock-up, right? It can either cause a machine lock-up or a reset and the event delivery to the software is not guaranteed. Thanks, Pawan