Hey Marc, On Tue, Sep 21, 2021 at 10:45:22AM +0100, Marc Zyngier wrote: > > > > > - How do you define which interrupts are actual wake-up events? > > > > > Nothing in the GIC architecture defines what a wake-up is (let alone > > > > > a wake-up event). > > > > > > > > Good point. > > > > > > > > One possible implementation of suspend could just be a `WFI` in a > > > > higher EL. In this case, KVM could emulate WFI wake up events > > > > according to D1.16.2 in DDI 0487G.a. But I agree, it isn't entirely > > > > clear what constitutes a wakeup from powered down state. > > > > > > It isn't, and it is actually IMPDEF (there isn't much in the ARM ARM > > > in terms of what constitutes a low power state). And even if you > > > wanted to emulate a WFI in userspace, the problem of interrupts that > > > have their source in the kernel remains. How to you tell userspace > > > that such an event has occurred if the vcpu thread isn't in the > > > kernel? > > > > Well, are there any objections to saying for the KVM implementation we > > observe the WFI wake-up events per the cited section of the ARM ARM? > > These are fine. However, what of the GIC, for example? Can any GIC > interrupt wake-up the guest? I'm happy to say "yes" to this, but I > suspect others will have a different idea, and the thought of > introducing an IMPDEF wake-up interrupt controller doesn't fill me > with joy. > I'm planning to propose exactly this in the next series; any GIC interrupt will wake the guest. I'd argue that if someone wants to do anything else, their window of opportunity is the exit to userspace. [...] > > Just to check understanding for v2: > > > > We agree that an exit to userspace is fine so it has the opportunity > > to do something crazy when the guest attempts a suspend. If a VMM does > > nothing and immediately re-enters the kernel, we emulate the suspend > > there by waiting for some event to fire, which for our purposes we > > will say is an interrupt originating from userspace or the kernel > > (WFI). In all, the SUSPEND exit type does not indicate that emulation > > terminates with the VMM. It only indicates we are about to block in > > the kernel. > > > > If there is some IMPDEF event specific to the VMM, it should signal > > the vCPU thread to kick it out of the kernel, make it runnable, and > > re-enter. No need to do anything special from the kernel perspective > > for this. This is only for the case where we decide to block in the > > kernel. > > This looks sensible. One question though: I think there is an implicit > requirement that the guest should be "migratable" in that state. How > does the above handles it? If the "suspend state" is solely held in > the kernel, we need to be able to snapshot it, and I don't like the > sound of that... > > We could instead keep the "suspend state" in the VMM: > > On PSCI_SUSPEND, the guest exits to userspace. If the VMM wants to > honour the supend request, it reenters the guest with RUN+SUSPEND, > which results in a WFI. On each wake-up, the guest exits to userspace, > and it is the VMM responsibility to either perform the wake-up (RUN) > or stay in suspend (RUN+SUSPEND). > > This ensures that the guest never transitions out of suspend without > the VMM knowing, and the VMM can always force a resume by kicking the > thread back to userspace. > > Thoughts? Agreed. I was mulling on exactly how to clue in the VMM about the suspend state. What if we just encode it in KVM_{GET,SET}_MP_STATE? We'd avoid the need for new UAPI that way. We could introduce a new state, KVM_MP_STATE_SUSPENDED, which would clue KVM to do the suspend as we've discussed. We would exit to userspace with KVM_MP_STATE_RUNNABLE, meaning the VMM would need to set the MP state explicitly for the in-kernel suspend to work. [...] > > > > On the contrary, it is up to KVM's implementation to > > > > guarantee caches are clean when servicing the guest request. > > > > > > This last point is pretty unclear to me. If the guest doesn't clean to > > > the PoC (or even to one of the PoPs) when it calls into suspend, > > > that's a clear indication that it doesn't care about its data. Why > > > should KVM be more conservative here? It shouldn't be in the business > > > of working around guest bugs. > > > > PSCI is vague on this, sadly. DEN0022D.b, 5.4.8 "Implementation > > responsibilities: Cache and coherency management states" that for > > CPU_SUSPEND, the PSCI implementation must perform a cache clean > > operation before entering the powerdown state. I don't see any reason > > why SYSTEM_SUSPEND should be excluded from this requirement. > > I'm not sure that's the case. CPU_SUSPEND may not use the resume > entry-point if the suspend results is a shallower state than expected > (i.e. the call just returns instead of behaving like a CPU boot). > > However, a successful SYSTEM_SUSPEND always results in the deepest > possible state. The guest should know that. There is also the fact > that performing a full clean to the PoC is going to be pretty > expensive, and I'd like to avoid that. > > I'll try and reach out to some of the ARM folks for clarification on > the matter. That'd be very helpful! -- Thanks, Oliver