Excerpts from Fabiano Rosas's message of April 7, 2021 7:46 am: > As one of the arguments of the H_ENTER_NESTED hypercall, the nested > hypervisor (L1) prepares a structure containing the values of various > hypervisor-privileged registers with which it wants the nested guest > (L2) to run. Since the nested HV runs in supervisor mode it needs the > host to write to these registers. > > To stop a nested HV manipulating this mechanism and using a nested > guest as a proxy to access a facility that has been made unavailable > to it, we have a routine that sanitises the values of the HV registers > before copying them into the nested guest's vcpu struct. > > However, when coming out of the guest the values are copied as they > were back into L1 memory, which means that any sanitisation we did > during guest entry will be exposed to L1 after H_ENTER_NESTED returns. > > This patch alters this sanitisation to have effect on the vcpu->arch > registers directly before entering and after exiting the guest, > leaving the structure that is copied back into L1 unchanged (except > when we really want L1 to access the value, e.g the Cause bits of > HFSCR). > > Signed-off-by: Fabiano Rosas <farosas@xxxxxxxxxxxxx> I like this direction better. Now "sanitise" may be the wrong word because you aren't cleaning the data in place any more but copying a cleaned version across. Which is fine but I wouldn't call it sanitise. Actually I would prefer if it is just done as part of the copy rather than copy first and then apply this (explained below in the code). > --- > I'm taking another shot at fixing this locally without resorting to > more complex things such as error handling and feature > advertisement/negotiation. That's okay, I think those are orthogonal problems. This won't help if a nested HV tries to use some HFSCR feature it believes should be available to a (say) POWER10 processor but got secretly masked away. But also such negotiation doesn't help with trying to minimise L0 HV data accessible to guests. (As before a guest can easily find out many of these things if it is determined to, but that does not mean I'm strongly against what you are doing here) > > Changes since v1: > > - made the change more generic, not only applies to hfscr anymore; > - sanitisation is now done directly on the vcpu struct, l2_hv is left unchanged; > > v1: > > https://lkml.kernel.org/r/20210305231055.2913892-1-farosas@xxxxxxxxxxxxx > --- > arch/powerpc/kvm/book3s_hv_nested.c | 33 +++++++++++++++++++++++------ > 1 file changed, 26 insertions(+), 7 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c > index 0cd0e7aad588..a60fccb2c4f2 100644 > --- a/arch/powerpc/kvm/book3s_hv_nested.c > +++ b/arch/powerpc/kvm/book3s_hv_nested.c > @@ -132,21 +132,37 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, int trap, > } > } > > -static void sanitise_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr) > +static void sanitise_vcpu_entry_state(struct kvm_vcpu *vcpu, > + const struct hv_guest_state *l2_hv, > + const struct hv_guest_state *l1_hv) > { > /* > * Don't let L1 enable features for L2 which we've disabled for L1, > * but preserve the interrupt cause field. > */ > - hr->hfscr &= (HFSCR_INTR_CAUSE | vcpu->arch.hfscr); > + vcpu->arch.hfscr = l2_hv->hfscr & (HFSCR_INTR_CAUSE | l1_hv->hfscr); > > /* Don't let data address watchpoint match in hypervisor state */ > - hr->dawrx0 &= ~DAWRX_HYP; > - hr->dawrx1 &= ~DAWRX_HYP; > + vcpu->arch.dawrx0 = l2_hv->dawrx0 & ~DAWRX_HYP; > + vcpu->arch.dawrx1 = l2_hv->dawrx1 & ~DAWRX_HYP; > > /* Don't let completed instruction address breakpt match in HV state */ > - if ((hr->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER) > - hr->ciabr &= ~CIABR_PRIV; > + if ((l2_hv->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER) > + vcpu->arch.ciabr = l2_hv->ciabr & ~CIABR_PRIV; > +} > + > + > +/* > + * During sanitise_vcpu_entry_state() we might have used bits from L1 > + * state to restrict what the L2 state is allowed to be. Since L1 is > + * not allowed to read the HV registers, do not include these > + * modifications in the return state. > + */ > +static void sanitise_vcpu_return_state(struct kvm_vcpu *vcpu, > + const struct hv_guest_state *l2_hv) > +{ > + vcpu->arch.hfscr = ((~HFSCR_INTR_CAUSE & l2_hv->hfscr) | > + (HFSCR_INTR_CAUSE & vcpu->arch.hfscr)); > } > > static void restore_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr) > @@ -324,9 +340,10 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu) > mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD | > LPCR_LPES | LPCR_MER; > lpcr = (vc->lpcr & ~mask) | (l2_hv.lpcr & mask); > - sanitise_hv_regs(vcpu, &l2_hv); > restore_hv_regs(vcpu, &l2_hv); > > + sanitise_vcpu_entry_state(vcpu, &l2_hv, &saved_l1_hv); So instead of doing this, can we just have one function that does load_hv_regs_for_l2()? > + > vcpu->arch.ret = RESUME_GUEST; > vcpu->arch.trap = 0; > do { > @@ -338,6 +355,8 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu) > r = kvmhv_run_single_vcpu(vcpu, hdec_exp, lpcr); > } while (is_kvmppc_resume_guest(r)); > > + sanitise_vcpu_return_state(vcpu, &l2_hv); And this could be done in save_hv_return_state(). I think? Question about HFSCR. Is it possible for some interrupt cause bit reaching the nested hypervisor for a bit that we thought we had enabled but was secretly masked off? I.e., do we have to filter HFSCR causes according to the facilities we secretly disabled? Thanks, Nick