> On Mar 11, 2022, at 12:59 PM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > On 3/11/22 12:38, Nadav Amit wrote: >>> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > ... >>> Can any sane code trigger this? >> >> Well, regarding this question and the previous one: I do not think that >> this scenario is possible today since mprotect() holds the mmap_lock >> for write. There is no other code that I am aware of that toggles >> the NX bit on a present entry. >> >> But I will not bet my life on it. That’s the reason for the somewhat >> vague phrasing that I used. > > From the userspace perspective, mmap(MAP_FIXED) can do this too. But, > sane userspace can't rely on the syscall to have done any work and the > TLB flushing is currently done before the syscall returns. > > I'd put it this way: > > Today, it is possible for a thread to end up in access_error() > for a PF_INSN fault and observe a VM_EXEC VMA. If you are > generous, this could be considered a spurious fault. > > However, the faulting thread would have had to race with the > thread which was changing the PTE and the VMA and is currently > *in* mprotect() (or some other syscall). In other words, the > faulting thread can encounter this situation, but it never had > any assurance from the kernel that it wouldn't fault. This is > because the faulting thread never had a chance to observe the > syscall return. > > There is no evidence that the existing behavior can cause any > issues with sane userspace. Done. Thanks. > >>>> index d0074c6ed31a..ad0ef0a6087a 100644 >>>> --- a/arch/x86/mm/fault.c >>>> +++ b/arch/x86/mm/fault.c >>>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) >>>> (error_code & X86_PF_INSTR), foreign)) >>>> return 1; >>>> >>>> - if (error_code & X86_PF_WRITE) { >>>> + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) { >>>> + /* >>>> + * CPUs are not expected to set the two error code bits >>>> + * together, but to ensure that hypervisors do not misbehave, >>>> + * run an additional sanity check. >>>> + */ >>>> + if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) == >>>> + (X86_PF_WRITE|X86_PF_INSTR)) { >>>> + WARN_ON_ONCE(1); >>>> + return 1; >>>> + } >>> >>> access_error() is only used on the do_user_addr_fault() side of things. >>> Can we stick this check somewhere that also works for kernel address >>> faults? This is a generic sanity check. It can also be in a separate >>> patch. >> >> I can wrap it in a different function and also call it from >> do_kern_addr_fault() or spurious_kernel_fault(). >> >> Anyhow, spurious_kernel_fault() should handle spurious faults on >> executable code correctly. > > This is really about checking the sanity of the "hardware"-provided > error code. Let's just do it in handle_page_fault(), maybe hidden in a > function like: > > void check_error_code_sanity(unsigned long error_code) > { > WARN_ON_ONCE(...); > } > > You can leave the X86_PF_PK check in place for now. It's probably going > away soon anyway. Done. Thanks. But note that removing the check from access_error() means that if the assertion is broken, userspace might crash inadvertently (in contrast to the version I sent, which would have potentially led to infinite stream of page-faults). I don’t know which behavior is better, so let’s go with your version and just hope it doesn’t happen. > >>> Also, we should *probably* stop talking about CPUs here. If there's >>> ever something wonky with error code bits, I'd put my money on a weird >>> hypervisor before any kind of CPU issue. >> >> I thought I manage to convey exactly that in the comment. Can you provide >> a better phrasing? > > Maybe: > > /* > * X86_PF_INSTR for instruction _fetches_. Fetches never write. > * X86_PF_WRITE should never be set with X86_PF_INSTR. > * > * This is most likely due to a buggy hypervisor. > */ Done, thank you.