Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

Dean Nelson <dnelson@xxxxxxxxxx> · Thu, 07 Oct 2010 10:23:43 -0500

On 10/06/2010 10:41 PM, Hidetoshi Seto wrote:
(2010/10/07 3:10), Dean Nelson wrote:
On 10/06/2010 11:05 AM, Marcelo Tosatti wrote:
On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:
I got some more question:

(2010/10/05 3:54), Marcelo Tosatti wrote:
Index: qemu/target-i386/cpu.h
===================================================================

--- qemu.orig/target-i386/cpu.h
+++ qemu/target-i386/cpu.h
@@ -250,16 +250,32 @@
   #define PG_ERROR_RSVD_MASK 0x08
   #define PG_ERROR_I_D_MASK  0x10

-#define MCG_CTL_P    (1UL<<8)   /* MCG_CAP register available */
+#define MCG_CTL_P    (1ULL<<8)   /* MCG_CAP register available */
+#define MCG_SER_P    (1ULL<<24) /* MCA recovery/new status bits */

-#define MCE_CAP_DEF    MCG_CTL_P
+#define MCE_CAP_DEF    (MCG_CTL_P|MCG_SER_P)
   #define MCE_BANKS_DEF    10


It seems that current kvm doesn't support SER_P, so injecting SRAO
to guest will mean that guest receives VAL|UC|!PCC and RIPV event
from virtual processor that doesn't have SER_P.

Dean also noted this. I don't think it was deliberate choice to not
expose SER_P. Huang?

In my testing, I found that MCG_SER_P was not being set (and I was
running on a Nehalem-EX system). Injecting a MCE resulted in the
guest entering into panic() from mce_panic(). If crash_kexec()
finds a kexec_crash_image the system ends up rebooting, otherwise,
what happens next requires operator intervention.

Good to know.
What I'm concerning is that if memory scrubbing SRAO event is
injected when !SER_P, linux guest with certain mce tolerant level
might grade it as "UC" severity and continue running with none of
panicking, killing and poisoning because of !PCC and RIPV.

Could you provide the panic message of the guest in your test?
I think it can tell me why the mce handler decided to go panic.

Sure, I'll add the info below at the end of this email.


When I applied a patch to the guest's kernel which forces mce_ser to be
set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found
that when the memory page was 'owned' by a guest process, the process
would be killed (if the page was dirty), and the guest would stay
running. The HWPoisoned page would be sidelined and not cause any more
issues.

Excellent.
So while guest kernel knows which page is poisoned, guest processes
are controlled not to touch the page.

... Therefore rebooting the vm and renewing kernel will lost the
information where is poisoned.

Correct.


I think most OSes don't expect that it can receives MCE with !PCC
on traditional x86 processor without SER_P.

Q1: Is it safe to expect that guests can handle such !PCC event?

This might be best answered by Huang, but as I mentioned above, without
MCG_SER_P being set, the result was an orderly system panic on the
guest.

Though I'll wait Huang (I think he is on holiday), I believe that
system panic is just a possible option for AO (Action Optional)
event, no matter how the SER_P is.

I think you may be correct, but Huang will know for sure.


Q2: What is the expected behavior on the guest?

I think I answered this above.

Yeah, thanks.


Q3: What happen if guest reboots itself in response to the MCE?

That depends...

And the following issue also holds for a guest that is rebooted at
some point having successfully sidelined the bad page.

After the guest has panic'd, a system_reset of the guest or a restart
initiated by crash_kexec() (called by panic() on the guest), usually
results in the guest hanging because the bad page still belongs
to qemu-kvm and is now being referenced by the new guest in some way.

Yes. In other words my concern about reboot is that new guest kernel
including kdump kernel might try to read the bad page.  If there is
no AR-SIGBUS etc., we need some tricks to inhibit such accesses.

Agreed.


(It actually may not hang, but successfully reboot and be runnable,
with the bad page lurking in the background. It all seems to depend on
where the bad page ends up, and whether it's ever referenced.)

I know some tough guys using their PC with buggy DIMMs :-)


I believe there was an attempt to deal with this in kvm on the host.
See kvm_handle_bad_page(). This function was suppose to result in the
sending of a BUS_MCEERR_AR flavored SIGBUS by do_sigbus() to qemu-kvm
which in theory would result in the right thing happening. But commit
96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being
sent. So this mechanism needs to be re-worked, and the issue remains.

Definitely.
I guess Huang has some plan or hint for rework this point.

Yeah, as far as I know Huang is looking into this.


I would think that if the the bad page can't be sidelined, such that
the newly booting guest can't use it, then the new guest shouldn't be
allowed to boot. But perhaps there is some merit in letting it try to
boot and see if one gets 'lucky'.

In case of booting a real machine in real world, hardware and firmware
usually (or often) do self-test before passing control to OS.
Some platform can boot OS with degraded configuration (for example,
fewer memory) if it has trouble on its component.  Some BIOS may
stop booting and show messages like "please reseat [component]" on the
screen.  So we could implement/request qemu to have such mechanism.

I can understand the merit you mentioned here, in some degree. But I
think it is hard to say "unlucky" to customer in business...

I totally agree.


I understand that Huang is looking into what should be done. He can
give you better information than I in answer to your questions.

Agreed. Thank you very much!

You're welcome.

Dean

Thanks,
H.Seto


::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The test I'm running is the mce-test suite's kvm test. A portion of
the messages it outputted (to stdout) follows:

Guest physical address is 0x71220000
Host virtual address is 7f9dc5020
Host physical address is 0x1051620000
Guest physical klog address is 0x71220

And it called mce-inject with the following data file:

[root@intel-s3e36-02 test]# cat SRAO
CPU 0 BANK 2
STATUS UNCORRECTED SRAO 0x17a
MCGSTATUS MCIP RIPV
MISC 0x8c
ADDR 0x1051620000
[root@intel-s3e36-02 test]#

The following is from the host's /var/log/messages:

Oct  7 09:42:48 intel-s3e36-02 kernel: Triggering MCE exception on CPU 0
Oct  7 09:42:48 intel-s3e36-02 kernel: Machine check events logged
Oct  7 09:42:48 intel-s3e36-02 kernel: MCE exception done on CPU 0
Oct  7 09:42:48 intel-s3e36-02 kernel: MCE 0x1051620: Killing qemu-system-x86:6867 early due to hardware memory corruption
Oct  7 09:42:48 intel-s3e36-02 kernel: MCE 0x1051620: dirty LRU page recovery: Recovered

Lastly, the following is a screen grab from the guest's serial console:

HARDWARE ERROR
CPU 0: Machine Check Exception:                5 Bank 9: bd000000000000c0
RIP !INEXACT! 33:<0000000000400428>
TSC 17a67acd14 ADDR 71220000 MISC 8c
PROCESSOR 0:6d3 TIME 1286458966 SOCKET 0 APIC 0
No human readable MCE decoding support on this CPU type.
Run the message through 'mcelog --ascii' to decode.
This is not a software problem!
Machine check: Uncorrected
Kernel panic - not syncing: Fatal machine check on current CPU
Pid:1493, comm: simple_process Tainted: B   M        ----------------  2.6.32.dnelson_test #48

Call Trace:
 <#MC>  [<ffffffff814c7c8d>] panic+0x78/0x137
 [<ffffffff81027382>] mce_panic+0x1e2/0x210
 [<ffffffff81028873>] do_machine_check+0x843/0xa70
 [<ffffffff814cb0cc>] machine_check+0x1c/0x30
 <<EOE>>


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html