On Tue, May 04, 2021 at 03:26:24PM +1000, Nicholas Piggin wrote: > Excerpts from Paul Mackerras's message of May 4, 2021 2:28 pm: > > On Sat, May 01, 2021 at 11:58:36AM +1000, Nicholas Piggin wrote: > >> Excerpts from Fabiano Rosas's message of April 16, 2021 9:09 am: > >> > As one of the arguments of the H_ENTER_NESTED hypercall, the nested > >> > hypervisor (L1) prepares a structure containing the values of various > >> > hypervisor-privileged registers with which it wants the nested guest > >> > (L2) to run. Since the nested HV runs in supervisor mode it needs the > >> > host to write to these registers. > >> > > >> > To stop a nested HV manipulating this mechanism and using a nested > >> > guest as a proxy to access a facility that has been made unavailable > >> > to it, we have a routine that sanitises the values of the HV registers > >> > before copying them into the nested guest's vcpu struct. > >> > > >> > However, when coming out of the guest the values are copied as they > >> > were back into L1 memory, which means that any sanitisation we did > >> > during guest entry will be exposed to L1 after H_ENTER_NESTED returns. > >> > > >> > This patch alters this sanitisation to have effect on the vcpu->arch > >> > registers directly before entering and after exiting the guest, > >> > leaving the structure that is copied back into L1 unchanged (except > >> > when we really want L1 to access the value, e.g the Cause bits of > >> > HFSCR). > >> > > >> > Signed-off-by: Fabiano Rosas <farosas@xxxxxxxxxxxxx> > >> > --- > >> > arch/powerpc/kvm/book3s_hv_nested.c | 55 ++++++++++++++++++----------- > >> > 1 file changed, 34 insertions(+), 21 deletions(-) > >> > > >> > diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c > >> > index 0cd0e7aad588..270552dd42c5 100644 > >> > --- a/arch/powerpc/kvm/book3s_hv_nested.c > >> > +++ b/arch/powerpc/kvm/book3s_hv_nested.c > >> > @@ -102,8 +102,17 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, int trap, > >> > { > >> > struct kvmppc_vcore *vc = vcpu->arch.vcore; > >> > > >> > + /* > >> > + * When loading the hypervisor-privileged registers to run L2, > >> > + * we might have used bits from L1 state to restrict what the > >> > + * L2 state is allowed to be. Since L1 is not allowed to read > >> > + * the HV registers, do not include these modifications in the > >> > + * return state. > >> > + */ > >> > + hr->hfscr = ((~HFSCR_INTR_CAUSE & hr->hfscr) | > >> > + (HFSCR_INTR_CAUSE & vcpu->arch.hfscr)); > >> > + > >> > hr->dpdes = vc->dpdes; > >> > - hr->hfscr = vcpu->arch.hfscr; > >> > hr->purr = vcpu->arch.purr; > >> > hr->spurr = vcpu->arch.spurr; > >> > hr->ic = vcpu->arch.ic; > >> > >> Do we still have the problem here that hfac interrupts due to bits cleared > >> by the hfscr sanitisation would have the cause bits returned to the L1, > >> so in theory it could probe hfscr directly that way? I don't see a good > >> solution to this except either have the L0 intercept these faults and do > >> "something" transparent, or return error from H_ENTER_NESTED (which would > >> also allow trivial probing of the facilities). > > > > It seems to me that there are various specific reasons why L0 would > > clear HFSCR bits, and if we think about the specific reasons, what we > > should do becomes clear. (I say "L0" but in fact the same reasoning > > applies to any hypervisor that lets its guest do hypervisor-ish > > things.) > > > > 1. Emulating a version of the architecture which doesn't have the > > feature in question - in that case the bit should appear to L1 as a > > reserved bit in HFSCR (i.e. always read 0), the associated facility > > code should never appear in the top 8 bits of any HFSCR value that L1 > > sees, and any HFU interrupt received by L0 for the facility should be > > changed into an illegal instruction interrupt (or HEAI) forwarded to > > L1. In this case the real HFSCR should always have the enable bit for > > the facility set to 0. > > > > 2. Lazy save/restore of the state associated with a facility - in this > > case, while the system is in the "lazy" state (i.e. the state is not > > that of the currently running guest), the real HFSCR bit for the > > facility should be 0. On an HFU interrupt for the facility, L0 looks > > at L1's HFSCR value: if it's 0, forward the HFU interrupt to L1; if > > it's 1, load up the facility state, set the facility's bit in HFSCR, > > and resume the guest. > > > > 3. Emulating a facility in software - in this case, the real HFSCR > > bit for the facility would always be 0. On an HFU interrupt, L0 reads > > the instruction and emulates it, then resumes the guest. > > > > One thing this all makes clear is that the IC field of the "virtual" > > HFSCR value seen by L1 should only ever be changed when L0 forwards a > > HFU interrupt to L1. > > > > In fact we currently never do (1) or (2), and we only do (3) for > > msgsndp etc., so this discussion is mostly theoretical. > > Yeah it's somewhat theoretical, and I guess I mostly agree with you. > > Missing is the case where the L0 does not implement a feature at all. > Let's say TM is broken so it disables it, or nobody uses TAR so it > doesn't bother to switch it. I think that's the same as my case (1), where L0 is presenting an architecture to L1 that doesn't implement the feature. (Unless you mean that L0 is running on a machine that has features that didn't exist at the time that L0's code was written; to handle that, L0 should clear all HFSCR bits that it doesn't know about, and map any unknown HFSCR[IC] code into illegal interrupt/HEAI.) > In those cases what do you tell the L1 if it enables a bit that you > don't support at all, and it takes a fault? Illegal instruction interrupt, or HEAI. > I guess the right thing to do is advertise that to the guest by some > other means, and expect it does the right thing. And you could have > the proviso in the nested HV specification that the returned IC field > might trip for a feature you enabled in the L1 HFSCR. I think that is confusing; better to return illegal instruction/HEAI. > > > >> Returning an hfac interrupt to a hypervisor that thought it enabled the > >> bit would be strange. But so does appearing to modify the register > >> underneath it and then returning a fault. > > > > I don't think we should ever do either of those things. The closest > > would be (1) above, but in that case the fault has to be either an > > illegal instruction type program interrupt, or a HEAI. > > > >> I think the sanest thing would actually be to return failure from the > >> hcall. > > > > I don't think we should do that either. > > I still think it's preferable for case 4. No point waiting for the > guest to boot and some user program eventually hits a bad instruction, > even if it was due to some host vs guest configuration problem. If you let it boot to userspace, the administrator has a better chance of recovering the situation. Paul.