> On Oct 25, 2021, at 7:20 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > On 10/21/21 5:21 AM, Nadav Amit wrote: >> access_error() currently does not check for execution permission >> violation. > Ye > >> As a result, spurious page-faults due to execution permission >> violation cause SIGSEGV. > > While I could totally believe that something is goofy when VMAs are > being changed underneath a page fault, I'm having trouble figuring out > why the "if (error_code & X86_PF_WRITE)" code is being modified. In the scenario I mentioned the VMAs are not changed underneath the page-fault. They change *before* the page-fault, but there are residues of the old PTE in the TLB. > >> It appears not to be an issue so far, but the next patches avoid TLB >> flushes on permission promotion, which can lead to this scenario. nodejs >> for instance crashes when TLB flush is avoided on permission promotion. > > Just to be clear, "promotion" is going from something like: > > W=0->W=1 > or > NX=1->NX=0 > > right? I tend to call that "relaxing" permissions. I specifically talk about NX=1>NX=0. I can change the language to “relaxing”. > > Currently, X86_PF_WRITE faults are considered an access error unless the > VMA to which the write occurred allows writes. Returning "no access > error" permits continuing and handling the copy-on-write. > > It sounds like you want to expand that. You want to add a whole class > of new faults that can be ignored: not just that some COW handling might > be necessary, but that the PTE itself might be out of date. Just like > a "COW fault" may just result in setting the PTE.W=1 and moving on with > our day, an instruction fault might now just end up with setting > PTE.NX=0 and also moving on with our day. You raise an interesting idea (which can easily be implemented with uffd), but no - I had none of that in my mind. My only purpose is to deal with actual spurious page-faults that I encountered when I removed the TLB flush the happens after NX=1->NX=0. I am actually surprised that the kernel makes such a strong assumption that every change of NX=1->NX=0 would be followed by a TLB flush, and that during these changes the mm is locked for write. But that is the case. If you do not have this change and a PTE is changed from NX=1->NX=0 and *later* you access the page, you can have a page-fault due to stale PTE, and get a SIGSEGV since access_error() is wrong to assume that this is an invalid access. I did not change and there are no changes to the VMA during the page-fault. The page-fault handler would do pretty much nothing and return to user-space which would retry the instruction. [ page-fault triggers an implicit TLB flush of the offending PTE ] > > I'm really confused why the "error_code & X86_PF_WRITE" case is getting > modified. I would have expected it to be something like just adding: > > /* read, instruction fetch */ > if (error_code & X86_PF_INSN) { > /* Avoid enforcing access error if spurious: */ > if (unlikely(!(vma->vm_flags & VM_EXEC))) > return 1; > return 0; > } > > I'm really confused what X86_PF_WRITE and X86_PF_INSN have in common > other than both being able to (now) be generated spuriously. That was my first version, but I was concerned that perhaps there is some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can be set. That is the reason that Peter asked you whether this is something that might happen. If you confirm they cannot be both set, I would the version you just mentioned.