Re: [PATCH v2 3/5] x86/mm: check exec permissions on fault

Nadav Amit <nadav.amit@xxxxxxxxx> · Mon, 25 Oct 2021 09:19:35 -0700

> On Oct 25, 2021, at 7:20 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
> 
> On 10/21/21 5:21 AM, Nadav Amit wrote:
>> access_error() currently does not check for execution permission
>> violation. 
> Ye
> 
>> As a result, spurious page-faults due to execution permission
>> violation cause SIGSEGV.
> 
> While I could totally believe that something is goofy when VMAs are
> being changed underneath a page fault, I'm having trouble figuring out
> why the "if (error_code & X86_PF_WRITE)" code is being modified.

In the scenario I mentioned the VMAs are not changed underneath the
page-fault. They change *before* the page-fault, but there are
residues of the old PTE in the TLB. 

> 
>> It appears not to be an issue so far, but the next patches avoid TLB
>> flushes on permission promotion, which can lead to this scenario. nodejs
>> for instance crashes when TLB flush is avoided on permission promotion.
> 
> Just to be clear, "promotion" is going from something like:
> 
> 	W=0->W=1
> or
> 	NX=1->NX=0
> 
> right?  I tend to call that "relaxing" permissions.

I specifically talk about NX=1>NX=0.

I can change the language to “relaxing”.

> 
> Currently, X86_PF_WRITE faults are considered an access error unless the
> VMA to which the write occurred allows writes.  Returning "no access
> error" permits continuing and handling the copy-on-write.
> 
> It sounds like you want to expand that.  You want to add a whole class
> of new faults that can be ignored: not just that some COW handling might
> be necessary, but that the PTE itself might be out of date.    Just like
> a "COW fault" may just result in setting the PTE.W=1 and moving on with
> our day, an instruction fault might now just end up with setting
> PTE.NX=0 and also moving on with our day.

You raise an interesting idea (which can easily be implemented with uffd),
but no - I had none of that in my mind.

My only purpose is to deal with actual spurious page-faults that I
encountered when I removed the TLB flush the happens after NX=1->NX=0.

I am actually surprised that the kernel makes such a strong assumption
that every change of NX=1->NX=0 would be followed by a TLB flush, and
that during these changes the mm is locked for write. But that is the
case. If you do not have this change and a PTE is changed from
NX=1->NX=0 and *later* you access the page, you can have a page-fault
due to stale PTE, and get a SIGSEGV since access_error() is wrong to
assume that this is an invalid access.

I did not change and there are no changes to the VMA during the
page-fault. The page-fault handler would do pretty much nothing and
return to user-space which would retry the instruction. [ page-fault
triggers an implicit TLB flush of the offending PTE ]

> 
> I'm really confused why the "error_code & X86_PF_WRITE" case is getting
> modified.  I would have expected it to be something like just adding:
> 
> 	/* read, instruction fetch */
> 	if (error_code & X86_PF_INSN) {
>                /* Avoid enforcing access error if spurious: */
>                if (unlikely(!(vma->vm_flags & VM_EXEC)))
>                        return 1;
>                return 0;
>        }
> 
> I'm really confused what X86_PF_WRITE and X86_PF_INSN have in common
> other than both being able to (now) be generated spuriously.

That was my first version, but I was concerned that perhaps there is
some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can
be set. That is the reason that Peter asked you whether this is
something that might happen.

If you confirm they cannot be both set, I would the version you just
mentioned.