Re: MCG_CAP ABI breakage (was Re: [Qemu-devel] [PATCH] target-i386: Do not set MCG_SER_P by default)

Borislav Petkov <bp@xxxxxxxxx> · Tue, 24 Nov 2015 19:44:19 +0100

On Tue, Nov 24, 2015 at 02:36:20PM -0200, Eduardo Habkost wrote:
> KVM_X86_SET_MCE does not call kvm_vcpu_ioctl_x86_setup_mce(). It
> calls kvm_vcpu_ioctl_x86_set_mce(), which stores the
> IA32_MCi_{STATUS,ADDR,MISC} register contents at
> vcpu->arch.mce_banks.

Ah, correct. I've mistakenly followed KVM_X86_SETUP_MCE and not
KVM_X86_SET_MCE, sorry.

Ok, so this makes more sense now - there's kvm_inject_mce_oldstyle() in
qemu and kvm_arch_on_sigbus_vcpu() which is on the SIGBUS handler path
actually does:

    if ((env->mcg_cap & MCG_SER_P) && addr
        && (code == BUS_MCEERR_AR || code == BUS_MCEERR_AO)) {
	    ...

I betcha that MCG_SER_P is set on every guest, even !Intel ones. I need
to go stare more at that code.

> I didn't check the QEMU MCE code to confirm that, but I assume it
> is implemented there. In that case, MCG_SER_P in
> KVM_MCE_CAP_SUPPORTED just indicates it can be implemented by
> userspace, as long as it makes the appropriate KVM_X86_SET_MCE
> (or maybe KVM_SET_MSRS?) calls.

I think it is that kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus()
which handles SIGBUS with BUS_MCEERR_AR/BUS_MCEERR_AO si_code. See
mm/memory-failure.c:kill_proc() in the kernel where we do send those
signals to processes.

However, I still think the MCG_SER_P bit being set on
!Intel is wrong even though the recovery action done by
kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus() is correct.

Why, you're asking. :-)

Well, what happens above is that the qemu process gets the signal that
there was an uncorrectable error detected in its memory and it is either
required to do something: BUS_MCEERR_AR == Action Required or its action
is optional: BUS_MCEERR_AO == Action Optional.

The SER_P text in the SDM describes those two:

"SRAO errors indicate that some data in the system is corrupt, but the
data has not been consumed and the processor state is valid. SRAO errors
provide the additional error information for system software to perform
a recovery action. An SRAO error is indicated with UC=1, PCC=0, S=1,
EN=1 and AR=0 in the IA32_MCi_STATUS register."

and

"Software recoverable action required (SRAR) - a UCR error that requires
system software to take a recovery action on this processor before
scheduling another stream of execution on this processor. SRAR errors
indicate that the error was detected and raised at the point of the
consumption in the execution flow. An SRAR error is indicated with UC=1,
PCC=0, S=1, EN=1 and AR=1 in the IA32_MCi_STATUS register."

And for that we don't need to look at SER_P in qemu - we only need to
know what the error severity of the error is and then we go and handle
accordingly.

Because those two si_codes are purely software-defined. And the
application which gets that SIGBUS type doesn't need to care about
SER_P.

For example, AMD has similar error severities and they can be injected
into qemu too. And qemu can do the exact same recovery actions based on
the severity without even looking at the SER_P bit.

So here's the problem:

* SER_P is set on all guests and it puzzles kernels running on !Intel
guests.

* Hardware error recovery actions can be done regardless of that bit.

The only case where that bit makes sense is if the emulated hardware
itself is generating accurate MCEs and then, as a result, wants to make
generate accurate error signatures:

SRAO:	UC=1, PCC=0, S=1, EN=1 and AR=0
SRAR:	UC=1, PCC=0, S=1, EN=1 and AR=1

Those bits should have these settings only when the emulated hw actually
implements SER_P. Otherwise, you'd get those old crude MCEs which are
either uncorrectable and generate an #MC or are correctable errors.

But ok, let me go do some staring at the examples you sent me
previously first. I might get a better idea after I sleep on it.

:-)

Thanks!

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html