> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > On 3/11/22 11:07, Nadav Amit wrote: >> From: Nadav Amit <namit@xxxxxxxxxx> >> >> access_error() currently does not check for execution permission >> violation. As a result, spurious page-faults due to execution permission >> violation cause SIGSEGV. > > This is a bit muddy on the problem statement. I get that spurious > faults can theoretically cause this, but *do* they in practice on > current kernels? > >> It appears not to be an issue so far, but the next patches avoid TLB >> flushes on permission promotion, which can lead to this scenario. nodejs >> for instance crashes when TLB flush is avoided on permission promotion. > > By "it appears not to be an issue", do you mean that this suboptimal > behavior can not be triggered, period? Or, it can be triggered but > folks seem not to care that it can be triggered? > > I *think* these can be triggered today. I think it takes two threads > that do something like: > > Thread 1 Thread 2 > ======== ======== > ptr = malloc(); > memcpy(ptr, &code, len); > exec_now = 1; > while (!exec_now); > call(ptr); > // fault > mprotect(ptr, PROT_EXEC, len); > // fault sees VM_EXEC > > > But that has a bug: exec_now is set before the mprotect(). It's not > sane code. > > Can any sane code trigger this? Well, regarding this question and the previous one: I do not think that this scenario is possible today since mprotect() holds the mmap_lock for write. There is no other code that I am aware of that toggles the NX bit on a present entry. But I will not bet my life on it. That’s the reason for the somewhat vague phrasing that I used. >> >> index d0074c6ed31a..ad0ef0a6087a 100644 >> --- a/arch/x86/mm/fault.c >> +++ b/arch/x86/mm/fault.c >> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) >> (error_code & X86_PF_INSTR), foreign)) >> return 1; >> >> - if (error_code & X86_PF_WRITE) { >> + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) { >> + /* >> + * CPUs are not expected to set the two error code bits >> + * together, but to ensure that hypervisors do not misbehave, >> + * run an additional sanity check. >> + */ >> + if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) == >> + (X86_PF_WRITE|X86_PF_INSTR)) { >> + WARN_ON_ONCE(1); >> + return 1; >> + } > > access_error() is only used on the do_user_addr_fault() side of things. > Can we stick this check somewhere that also works for kernel address > faults? This is a generic sanity check. It can also be in a separate > patch. I can wrap it in a different function and also call it from do_kern_addr_fault() or spurious_kernel_fault(). Anyhow, spurious_kernel_fault() should handle spurious faults on executable code correctly. > > Also, we should *probably* stop talking about CPUs here. If there's > ever something wonky with error code bits, I'd put my money on a weird > hypervisor before any kind of CPU issue. I thought I manage to convey exactly that in the comment. Can you provide a better phrasing? > >> /* write, present and write, not present: */ >> - if (unlikely(!(vma->vm_flags & VM_WRITE))) >> + if ((error_code & X86_PF_WRITE) && >> + unlikely(!(vma->vm_flags & VM_WRITE))) >> + return 1; >> + >> + /* exec, present and exec, not present: */ >> + if ((error_code & X86_PF_INSTR) && >> + unlikely(!(vma->vm_flags & VM_EXEC))) >> return 1; >> + >> return 0; >> } > > This is getting really ugly. I think we've gone over this before, but > it escapes me. Why do we need a common (X86_PF_WRITE | X86_PF_INSTR) > block of code? Why can't we just add a simple X86_PF_INSN if() that > mirrors the current X86_PF_WRITE one? > > > if (error_code & X86_PF_INSN) { > /* present and not exec: */ > if (unlikely(!(vma->vm_flags & VM_EXEC))) > return 1; > return 0; > } You are correct. My bad. I will fix it.