Hi James/Peter, thanks for this discussion, and sorry for my late response due to vacation. On 2018/12/22 2:17, James Morse wrote: > Hi Peter, > > On 19/12/2018 19:02, Peter Maydell wrote: >> On Mon, 17 Dec 2018 at 15:56, James Morse <james.morse@xxxxxxx> wrote: >>> I don't think this really matters. Its only the NMIlike notifications that the >>> guest doesn't have to register or poll. The ones we support today extend the >>> architectures existing behaviour: you would have taken an external-abort on a >>> real system, whether you know about the additional metadata doesn't matter to Qemu. >> >> Consider the case where we booted the guest using a DTB and no ACPI >> table at all -- we certainly can't just call QEMU code that tries to >> add entries to a nonexistent table. > > Sure, because you know which of the two sets of firmware-table you're providing. > > I'm taking the behaviour of physical machines as the template for what we should > do here. I can boot a DT-only kernel on Seattle. Firmware has no idea I did > this, it will still take DRAM uncorrected-error IRQs in firmware, and generate > CPER records in the POLLed areas. But the kernel will never look, because it > booted with DT. > What happens if the kernel goes on to access the corrupt location? It either > gets corrupt values back, or an external abort, depending on the design of the > memory-controller. > > X-gene uses an IRQ for its firmware-first notification. Booted with DT that > interrupt can be asserted, but as the OS has didn't know to register it, its > never taken. We eventually get the same corrupt-values/external-abort behaviour. > > KVM/Linux is acting as the memory controller using stage2. When an error is > detected by the host it unmaps the page from stage2, and refuses to map it again > until its fixed up in Qemu's memory map (which can happen automatically). If the > kernel can't fix it itself, the AO signal is like the DRAM-IRQ above, and the AR > like the external abort. > We don't have a parallel to the 'gets corrupt values back' behaviour as Linux > will always unmap hwpoison pages from user-space/guests. > > If the host-kernel wasn't build with CONFIG_MEMORY_FAILURE, its like the memory > controller doesn't support any of the above. I think knowing this is the closest > to what you want. > > >> My main point is that there >> needs to be logic in Dongjiu's QEMU patches that checks more than >> just "does this KVM feature exist". I'm not sufficiently familiar >> with all this RAS stuff to be certain what those checks should >> be and what the right choices are; I just know we need to check >> *something*... > > I think this is the crux of where we don't see this the same way. > The v8.2 RAS stuff is new, RAS support on arm64 is not. Kernel support arrived > at roughly the same time, but not CPU support. There are v8.0 systems that > support RAS. There are DT systems that can do the same with edac drivers. > The physical v8.0 systems that do this, are doing it without any extra CPU support. > > I think x86's behaviour here includes some history, which we don't have. >>From the order of the HEST entries, it looks like the machine-check stuff came > first, then firmware-first using a 'GHES' entry in that table. > I think Qemu on x86 only supports the emulated machine check stuff, so it needs > to check KVM has the widget to do this. > If Qemu on x86 supported firmware-first, I don't think there would be anything > to check. (details below) Peter, I summarize James's main idea, James think QEMU does not needs to check *something* if Qemu support firmware-first. What do we do for your comments? > > >>>> Let us see the X86's QEMU logic: >>>> 1. Before the vCPU created, it will set a default env->mcg_cap value with >>> >>>> MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports >>>> RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host >>>> kernel support this feature[2]. Only when host kernel and default env->mcg_cap >>>> value all expected this feature, then it will setup vCPU support RAS error >>>> recovery[3]. >>> >>> This looks like KVM exposing a CPU capability to Qemu, which then configures the >>> behaviour KVM gives to the guest. This doesn't tell you anything about what the >>> guest supports. >> >> It tells you what the *guest CPU* supports, which for x86 is a combination >> of (a) what did the user/machine model ask for and (b) what can KVM >> actually implement. I don't much care whether the guest OS supports >> anything or not, that's its business... but it does seem odd to me >> that the equivalent Arm code is not similarly saying "what were we >> asked for, and what can we do?". > > The flow is something like: > For AO, generate CPER records, and notify the OS via NOTIFY_POLL (which isn't > really a notification) or some flavour of IRQ. > To do this, Qemu needs to be able to write to its reserved area of guest memory, > and possibly trigger an interrupt. > > For AR, generate CPER records and notify the OS via external abort. (the > presence of the CPER records makes this NOTIFY_SEA or NOTIFY_SEI). > To do this, Qemu again needs to be able to write to guest memory, set guest > registers (KVM_SET_ONE_REG()). If it wants to inject an > SError-Interrupt/Asynchronous-external-abort while the guest has it masked, it > needs KVM_SET_VCPU_EVENTS(). > > Nothing here depends on the CPU or kernel configuration. This is all ACPI stuff, > so its the same on x86. (The only difference is external-abort becomes NMI, > which is probably done through SET_VCPU_EVENTS()) > > What were we asked for? Qemu wants to know if it can write to guest memory, > guest registers (for synchronous external abort) and trigger interrupts. It has > always been able to do these things. > > >> I think one question here which it would be good to answer is: >> if we are modelling a guest and we haven't specifically provided >> it an ACPI table to tell it about memory errors, what do we do >> when we get a sigbus from the host? We have basically two choices: >> (1) send the guest an SError (aka asynchronous external abort) >> anyway (with no further info about what the memory error is) > > For an AR signal an external abort is valid. Its up to the implementation > whether these are synchronous or asynchronous. Qemu can only take a signal for > something that was synchronous, so you can choose between the two. > Synchronous external abort is marginally better as an unaware OS knows its > affects this thread, and may be able to kill it. > SError with an imp-def ESR is indistinguishable from 'part of the soc fell out', > and should always result in a panic(). > > >> (2) just stop QEMU (as we would for a memory error in QEMU's >> own memory) > > This is also valid. A machine may take external-abort to EL3 and then > reboot/crash/burn. > > > Just in case this is the deeper issue: I keep picking on memory-errors, what > about CPU errors? > Linux can't handle these at all, unless they are also memory errors. If we take > an imprecise abort from a guest KVM can't tell Qemu using signals. We don't have > any mechanism to tell user-space about imprecise exceptions. In this case KVM > throws an imp-def SError back at the affected vcpu, these are allowed to be > imprecise, as this is the closest thing we have. > > This does mean that any AO/AR signal Qemu gets is a memory error. > > > Happy New Year, > > James > > . >