Hi, Huang-san, (2010/10/08 12:15), Huang Ying wrote: > Hi, Seto, > > On Thu, 2010-10-07 at 11:41 +0800, Hidetoshi Seto wrote: >> (2010/10/07 3:10), Dean Nelson wrote: >>> On 10/06/2010 11:05 AM, Marcelo Tosatti wrote: >>>> On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote: >>>>> I got some more question: >>>>> >>>>> (2010/10/05 3:54), Marcelo Tosatti wrote: >>>>>> Index: qemu/target-i386/cpu.h >>>>>> =================================================================== >>>>>> --- qemu.orig/target-i386/cpu.h >>>>>> +++ qemu/target-i386/cpu.h >>>>>> @@ -250,16 +250,32 @@ >>>>>> #define PG_ERROR_RSVD_MASK 0x08 >>>>>> #define PG_ERROR_I_D_MASK 0x10 >>>>>> >>>>>> -#define MCG_CTL_P (1UL<<8) /* MCG_CAP register available */ >>>>>> +#define MCG_CTL_P (1ULL<<8) /* MCG_CAP register available */ >>>>>> +#define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ >>>>>> >>>>>> -#define MCE_CAP_DEF MCG_CTL_P >>>>>> +#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) >>>>>> #define MCE_BANKS_DEF 10 >>>>>> >>>>> >>>>> It seems that current kvm doesn't support SER_P, so injecting SRAO >>>>> to guest will mean that guest receives VAL|UC|!PCC and RIPV event >>>>> from virtual processor that doesn't have SER_P. >>>> >>>> Dean also noted this. I don't think it was deliberate choice to not >>>> expose SER_P. Huang? >>> >>> In my testing, I found that MCG_SER_P was not being set (and I was >>> running on a Nehalem-EX system). Injecting a MCE resulted in the >>> guest entering into panic() from mce_panic(). If crash_kexec() >>> finds a kexec_crash_image the system ends up rebooting, otherwise, >>> what happens next requires operator intervention. >> >> Good to know. >> What I'm concerning is that if memory scrubbing SRAO event is >> injected when !SER_P, linux guest with certain mce tolerant level >> might grade it as "UC" severity and continue running with none of >> panicking, killing and poisoning because of !PCC and RIPV. >> >> Could you provide the panic message of the guest in your test? >> I think it can tell me why the mce handler decided to go panic. > > That is a bug that the SER_P is not in KVM_MCE_CAP_SUPPORTED in kernel. > I will fix it as soon as possible. And SRAO MCE should not be sent > when !SER_P, we should add that condition in qemu-kvm. That makes sense. I think it is qemu's responsibility for what follows the AO-SIGBUS, what action should be taken depends on the KVM's capability. >>> When I applied a patch to the guest's kernel which forces mce_ser to be >>> set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found >>> that when the memory page was 'owned' by a guest process, the process >>> would be killed (if the page was dirty), and the guest would stay >>> running. The HWPoisoned page would be sidelined and not cause any more >>> issues. >> >> Excellent. >> So while guest kernel knows which page is poisoned, guest processes >> are controlled not to touch the page. >> >> ... Therefore rebooting the vm and renewing kernel will lost the >> information where is poisoned. > > Yes. That is an issue. Dean suggests that make qemu-kvm to refuse reboot > the guest if there is poisoned page and ask for user to intervention. I > have another idea to replace the poison pages with good pages when > reboot, that is, recover without user intervention. Sounds good. I think it may be worth something to reserve pages for the replacement before reboot is requested; at least we really don't want to fail rebooting with 'no memory'. >>>>> I think most OSes don't expect that it can receives MCE with !PCC >>>>> on traditional x86 processor without SER_P. >>>>> >>>>> Q1: Is it safe to expect that guests can handle such !PCC event? >>> >>> This might be best answered by Huang, but as I mentioned above, without >>> MCG_SER_P being set, the result was an orderly system panic on the >>> guest. >> >> Though I'll wait Huang (I think he is on holiday), I believe that >> system panic is just a possible option for AO (Action Optional) >> event, no matter how the SER_P is. > > We should fix this as I said above. > >>>>> Q2: What is the expected behavior on the guest? >>> >>> I think I answered this above. >> >> Yeah, thanks. >> >>> >>>>> Q3: What happen if guest reboots itself in response to the MCE? >>> >>> That depends... >>> >>> And the following issue also holds for a guest that is rebooted at >>> some point having successfully sidelined the bad page. >>> >>> After the guest has panic'd, a system_reset of the guest or a restart >>> initiated by crash_kexec() (called by panic() on the guest), usually >>> results in the guest hanging because the bad page still belongs >>> to qemu-kvm and is now being referenced by the new guest in some way. >> >> Yes. In other words my concern about reboot is that new guest kernel >> including kdump kernel might try to read the bad page. If there is >> no AR-SIGBUS etc., we need some tricks to inhibit such accesses. >> >>> (It actually may not hang, but successfully reboot and be runnable, >>> with the bad page lurking in the background. It all seems to depend on >>> where the bad page ends up, and whether it's ever referenced.) >> >> I know some tough guys using their PC with buggy DIMMs :-) >> >>> >>> I believe there was an attempt to deal with this in kvm on the host. >>> See kvm_handle_bad_page(). This function was suppose to result in the >>> sending of a BUS_MCEERR_AR flavored SIGBUS by do_sigbus() to qemu-kvm >>> which in theory would result in the right thing happening. But commit >>> 96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being >>> sent. So this mechanism needs to be re-worked, and the issue remains. >> >> Definitely. >> I guess Huang has some plan or hint for rework this point. > > Yes. This should be fixed. The SRAR SIGBUS should be sent directly > instead of being sent via touching poisoned virtual address. Good. It should work. >>> I would think that if the the bad page can't be sidelined, such that >>> the newly booting guest can't use it, then the new guest shouldn't be >>> allowed to boot. But perhaps there is some merit in letting it try to >>> boot and see if one gets 'lucky'. >> >> In case of booting a real machine in real world, hardware and firmware >> usually (or often) do self-test before passing control to OS. >> Some platform can boot OS with degraded configuration (for example, >> fewer memory) if it has trouble on its component. Some BIOS may >> stop booting and show messages like "please reseat [component]" on the >> screen. So we could implement/request qemu to have such mechanism. >> >> I can understand the merit you mentioned here, in some degree. But I >> think it is hard to say "unlucky" to customer in business... > > Because the contents of poisoned pages are not relevant after reboot. > Qemu can replace the poisoned pages with good pages when reboot guest. > Do you think that is good. Sure. Of course this trick will not needed if user has done migration or save/restore the guest before a reboot. Thank you for answering! Thanks, H.Seto -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html