On Thu, Sep 17, 2009 at 09:13:29AM +0800, Huang Ying wrote: > On Thu, 2009-09-17 at 01:59 +0800, Marcelo Tosatti wrote: > > On Wed, Sep 09, 2009 at 10:28:02AM +0800, Huang Ying wrote: > > > UCR (uncorrected recovery) MCE is supported in recent Intel CPUs, > > > where some hardware error such as some memory error can be reported > > > without PCC (processor context corrupted). To recover from such MCE, > > > the corresponding memory will be unmapped, and all processes accessing > > > the memory will be killed via SIGBUS. > > > > > > For KVM, if QEMU/KVM is killed, all guest processes will be killed > > > too. So we relay SIGBUS from host OS to guest system via a UCR MCE > > > injection. Then guest OS can isolate corresponding memory and kill > > > necessary guest processes only. SIGBUS sent to main thread (not VCPU > > > threads) will be broadcast to all VCPU threads as UCR MCE. > > > > > > v2: > > > > > > - Use qemu_ram_addr_from_host instead of self made one to covert from > > > host address to guest RAM address. Thanks Anthony Liguori. > > > > > > Signed-off-by: Huang Ying <ying.huang@xxxxxxxxx> > > > > > > --- > > > cpu-common.h | 1 > > > exec.c | 20 +++++-- > > > qemu-kvm.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++---- > > > target-i386/cpu.h | 20 ++++++- > > > 4 files changed, 178 insertions(+), 17 deletions(-) > > > > > > --- a/qemu-kvm.c > > > +++ b/qemu-kvm.c > > > @@ -27,10 +27,23 @@ > > > #include <sys/mman.h> > > > #include <sys/ioctl.h> > > > #include <signal.h> > > > +#include <sys/signalfd.h> > > > +#include <sys/prctl.h> > > > > > > #define false 0 > > > #define true 1 > > > > > > +#ifndef PR_MCE_KILL > > > +#define PR_MCE_KILL 33 > > > +#endif > > > + > > > +#ifndef BUS_MCEERR_AR > > > +#define BUS_MCEERR_AR 4 > > > +#endif > > > +#ifndef BUS_MCEERR_AO > > > +#define BUS_MCEERR_AO 5 > > > +#endif > > > + > > > #define EXPECTED_KVM_API_VERSION 12 > > > > > > #if EXPECTED_KVM_API_VERSION != KVM_API_VERSION > > > @@ -1507,6 +1520,37 @@ static void sig_ipi_handler(int n) > > > { > > > } > > > > > > +static void sigbus_handler(int n, struct signalfd_siginfo *siginfo, void *ctx) > > > +{ > > > + if (siginfo->ssi_code == BUS_MCEERR_AO) { > > > + uint64_t status; > > > + unsigned long paddr; > > > + CPUState *cenv; > > > + > > > + /* Hope we are lucky for AO MCE */ > > > + if (do_qemu_ram_addr_from_host((void *)siginfo->ssi_addr, &paddr)) { > > > + fprintf(stderr, "Hardware memory error for memory used by " > > > + "QEMU itself instead of guest system!: %llx\n", > > > + (unsigned long long)siginfo->ssi_addr); > > > + return; > > > > qemu-kvm should die here? > > There are two kinds of UCR MCE. One is triggered by user space/guest > read/write, the other is triggered by asynchronously detected error > (e.g. patrol scrubbing). The latter one is reported as AO (Action > Optional) MCE, and it has nothing to do with current path. So if we are > lucky enough, we can survive. And when we finally touch the error memory > reported by AO MCE, another AR (Action Required) MCE will be triggered. > We have another chance to deal with it. OK. > > > > + } > > > + status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN > > > + | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S > > > + | 0xc0; > > > + kvm_inject_x86_mce(first_cpu, 9, status, > > > + MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr, > > > + (MCM_ADDR_PHYS << 6) | 0xc); > > > + for (cenv = first_cpu->next_cpu; cenv != NULL; cenv = cenv->next_cpu) > > > + kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC, > > > + MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0); > > > + return; > > > > Should abort if kvm_inject_x86_mce fails? > > kvm_inject_x86_mce will abort by itself. OK. > > > > + } else if (siginfo->ssi_code == BUS_MCEERR_AR) > > > + fprintf(stderr, "Hardware memory error!\n"); > > > + else > > > + fprintf(stderr, "Internal error in QEMU!\n"); > > > > Can you re-raise SIGBUS so you we get a coredump on non-MCE SIGBUS as > > usual? > > We discuss this before. Copied below, please comment the comments > below, :) > > Avi: > (also, I if we can't handle guest-mode SIGBUS I think it would be nice > to raise it again so the process terminates due to the SIGBUS). > > Huang Ying: > For SIGBUS we can not relay to guest as MCE, we can either abort or > reset SIGBUS to SIGDFL and re-raise it. Both are OK for me. You prefer > the latter one? > > Andi: > I think a suitable error message and exit would be better than a plain > signal kill. It shouldn't look like qemu crashed due to a software > bug. Ideally a error message in a way that it can be parsed by libvirt > etc. and reported in a suitable way. > > However qemu getting killed itself is very unlikely, it doesn't > have much memory foot print compared to the guest and other data. > So this should be a very rare condition. > > Avi: > libvirt etc. can/should wait() for qemu to terminate abnormally and > report the reason why. However it doesn't seem there is a way to get > extended signal information from wait(), so it looks like internal > handling by qemu is better. I'm not talking about SIGBUS generated by MCE. What i mean is, for SIGBUS signals that are not due to MCE errors, the current behaviour is to generate a core dump (which is useful information for debugging). With your patch, qemu-kvm handles the signal, prints a message before exiting. This is annoying. It seems the discussion above is about SIGBUS initiated by MCE errors. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html