Hi Marc, On Thu, Jul 11, 2024 at 01:16:42PM +0100, Marc Zyngier wrote: > On Thu, 11 Jul 2024 11:56:13 +0100, > Alexandru Elisei <alexandru.elisei@xxxxxxx> wrote: > > > > Hi, > > > > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote: > > > In order to plug the brokenness of our current AT implementation, > > > we need a SW walker that is going to... err.. walk the S1 tables > > > and tell us what it finds. > > > > > > Of course, it builds on top of our S2 walker, and share similar > > > concepts. The beauty of it is that since it uses kvm_read_guest(), > > > it is able to bring back pages that have been otherwise evicted. > > > > > > This is then plugged in the two AT S1 emulation functions as > > > a "slow path" fallback. I'm not sure it is that slow, but hey. > > > [..] > > > switch (op) { > > > case OP_AT_S1E1RP: > > > case OP_AT_S1E1WP: > > > + retry_slow = false; > > > fail = check_at_pan(vcpu, vaddr, &par); > > > break; > > > default: > > > goto nopan; > > > } > > > > For context, this is what check_at_pan() does: > > > > static int check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res) > > { > > u64 par_e0; > > int error; > > > > /* > > * For PAN-involved AT operations, perform the same translation, > > * using EL0 this time. Twice. Much fun. > > */ > > error = __kvm_at(OP_AT_S1E0R, vaddr); > > if (error) > > return error; > > > > par_e0 = read_sysreg_par(); > > if (!(par_e0 & SYS_PAR_EL1_F)) > > goto out; > > > > error = __kvm_at(OP_AT_S1E0W, vaddr); > > if (error) > > return error; > > > > par_e0 = read_sysreg_par(); > > out: > > *res = par_e0; > > return 0; > > } > > > > I'm having a hard time understanding why KVM is doing both AT S1E0R and AT S1E0W > > regardless of the type of the access (read/write) in the PAN-aware AT > > instruction. Would you mind elaborating on that? > > Because that's the very definition of an AT S1E1{W,R}P instruction > when PAN is set. If *any* EL0 permission is set, then the translation > must equally fail. Just like a load or a store from EL1 would fail if > any EL0 permission is set when PSTATE.PAN is set. > > Since we cannot check for both permissions at once, we do it twice. > It is worth noting that we don't quite handle the PAN3 case correctly > (because we can't retrieve the *execution* property using AT). I'll > add that to the list of stuff to fix. > > > > > > + if (fail) { > > > + vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1); > > > + goto nopan; > > > + } > > > [..] > > > + if (par & SYS_PAR_EL1_F) { > > > + u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par); > > > + > > > + /* > > > + * If we get something other than a permission fault, we > > > + * need to retry, as we're likely to have missed in the PTs. > > > + */ > > > + if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM) > > > + retry_slow = true; > > > + } else { > > > + /* > > > + * The EL0 access succeded, but we don't have the full > > > + * syndrom information to synthetize the failure. Go slow. > > > + */ > > > + retry_slow = true; > > > + } > > > > This is what PSTATE.PAN controls: > > > > If the Effective value of PSTATE.PAN is 1, then a privileged data access from > > any of the following Exception levels to a virtual memory address that is > > accessible to data accesses at EL0 generates a stage 1 Permission fault: > > > > - A privileged data access from EL1. > > - If HCR_EL2.E2H is 1, then a privileged data access from EL2. > > > > With that in mind, I am really struggling to understand the logic. > > I don't quite see what you don't understand, you'll have to be more > precise. Are you worried about the page tables we're looking at, the > value of PSTATE.PAN, the permission fault, or something else? > > It also doesn't help that you're looking at the patch that contains > the integration with the slow-path, which is pretty hard to read (I > have a reworked version that's a bit better). You probably want to > look at the "fast" path alone. I was referring to checking both unprivileged read and write permissions. And you are right, sorry, I managed to get myself terribly confused. For completeness sake, this matches AArch64.S1DirectBasePermissions(), where if PAN && (UnprivRead || UnprivWrite) then PrivRead = False and PrivWrite = False. So you need to check that both UnprivRead and UnprivWrite are false for the PAN variants of AT to succeed. > > > > > If AT S1E0{R,W} (from check_at_pan()) failed, doesn't that mean that the virtual > > memory address is not accessible to EL0? Add that to the fact that the AT > > S1E1{R,W} (from the beginning of __kvm_at_s1e01()) succeeded, doesn't that mean > > that AT S1E1{R,W}P should succeed, and furthermore the PAR_EL1 value should be > > the one KVM got from AT S1E1{R,W}? > > There are plenty of ways for AT S1E0 to fail when AT S1E1 succeeded: > > - no EL0 permission: that's the best case, and the PAR_EL1 obtained > from the AT S1E1 is the correct one. That's what we return. Yes, that is correct, the place where VCPUs PAR_EL1 register is set is far enough from this code that I didn't make the connection. > > - The EL0 access failed, but for another reason than a permission > fault. This contradicts the EL1 walk, and is a sure sign that > someone is playing behind our back. We fail. > > - exception from AT S1E0: something went wrong (again the guest > playing with the PTs behind our back). We fail as well. > > Do you at least agree with these as goals? If you do, what in > the implementation does not satisfy these goals? If you don't, what in > these goals seem improper to you? I agree with the goals. In this patch, if I'm reading the code right (and I'm starting to doubt myself) if PAR_EL1.F is set and PAR_EL1 doesn't indicate a permissions fault, then KVM falls back to walking the S1 tables: if (par & SYS_PAR_EL1_F) { u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par); /* * If we get something other than a permission fault, we * need to retry, as we're likely to have missed in the PTs. */ if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM) retry_slow = true; } I suppose that's because KVM cannot distinguish between two very different reasons for AT failing: 1, because of something being wrong with the stage 1 tables when the AT S1E0* instruction was executed and 2, because of missing entries at stage 2, as per the comment. Is that correct? Thanks, Alex