> > On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@xxxxxxx> wrote: > > > > Hi Dongjiu Geng, > > > > On 14/12/2018 10:15, Dongjiu Geng wrote: > > > When user space do memory recovery, it will check whether KVM and > > > guest support the error recovery, only when both of them support, > > > user space will do the error recovery. This patch exports this > > > capability of KVM to user space. > > > > I can understand user-space only wanting to do the work if host and > > guest support the feature. But 'error recovery' isn't a KVM feature, > > its a Linux kernel feature. > > > > KVM will send it's user-space a SIGBUS with MCEERR code whenever its > > trying to map a page at stage2 that the kernel-mm code refuses this because its poisoned. > > (e.g. check_user_page_hwpoison(), get_user_pages() returns > > -EHWPOISON) > > > > This is exactly the same as happens to a normal user-space process. > > > > I think you really want to know if the host kernel was built with > > CONFIG_MEMORY_FAILURE. > > Does userspace need to care about that? Presumably if the host kernel > wasn't built with that support then it will simply never deliver any memory failure events to QEMU, which is fine. > > The point I was trying to make in the email Dongjiu references > (https://patchwork.codeaurora.org/patch/652261/) is simply that "QEMU gets memory-failure notifications from the host kernel" > does not imply "the guest is prepared to receive memory failure > notifications", and so the code path which handles the SIGBUS must do > some kind of check for whether the guest CPU is a type which expects them and that the board code set up the ACPI tables that it wants to fill in. Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion. To James, I explain more to you, as peter said QEMU needs to check whether the guest CPU is a type which can handle the error though guest ACPI table. Let us see the X86's QEMU logic: 1. Before the vCPU created, it will set a default env->mcg_cap value with MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host kernel support this feature[2]. Only when host kernel and default env->mcg_cap value all expected this feature, then it will setup vCPU support RAS error recovery[3]. So I add this IOCTL "KVM_CAP_ARM_MEMORY_ERROR_RECOVERY" to Let QEMU check whether host/KVM support RAS error detection and recovery, only when this supports, QEMU will do the error recovery for the guest memory. [1] #define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) cenv->mcg_cap = MCE_CAP_DEF | MCE_BANKS_DEF | (cpu->enable_lmce ? MCG_LMCE_P : 0); [2] ret = kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks); [3] env->mcg_cap &= mcg_cap | MCG_CAP_BANKS_MASK; ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &env->mcg_cap); -------------------------------------For James's comments--------------------------------------------------------------------- > KVM doesn't detect these errors. > The hardware detects them and notifies the OS via one of a number of mechanisms. > This gets plumbed into memory_failure(), which sets a flag that the mm > code uses to prevent the page being used again. > KVM is only involved when it tries to map a page at stage2 and the mm > code rejects it with -EHWPOISON. This is the same as the architectures > do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of > handle_mm_fault(). We don't have a KVM cap for this, nor do we need one. ------------------------------------------------------------------------------------------------------------------------------ James, for your above comments, I completed understand, but KVM also delivered the SIGBUS, which means KVM supports guest memory RAS error recovery, so maybe we need to tell user space this capability. ---------------------------------------------- For James's comments --------------------------------------------------- > The CPU RAS Extensions are not at all relevant here. It is perfectly > possible to support memory-failure without them, AMD-Seattle and > APM-X-Gene do this. These systems would report not-supported here, but the kernel does support this stuff. > Just because the CPU supports this, doesn't mean the kernel was built > with CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL. -------------------------------------------------------------------------------------------------------------------------------------- James, for your above comments, if you think we should not check the "cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", which do you prefer we should check? In the X86 KVM code, it uses hardcode to tell use space the host/KVM support RAS error software recovery[4]. If KVM does not check the " cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", we have to check the hardcode as X86's method. [4]: u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P; long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { ............................................................................ case KVM_X86_GET_MCE_CAP_SUPPORTED: { r = -EFAULT; if (copy_to_user(argp, &kvm_mce_cap_supported, sizeof(kvm_mce_cap_supported))) goto out; r = 0; break; } ....................................................................... } > > thanks > -- PMM