Re: [PATCH v2 0/5] implement lightweight guard pages

Dmitry Vyukov <dvyukov@xxxxxxxxxx> · Wed, 23 Oct 2024 10:56:33 +0200

On Wed, 23 Oct 2024 at 10:12, Lorenzo Stoakes
<lorenzo.stoakes@xxxxxxxxxx> wrote:
>
> +cc Linus as reference a commit of his below...
>
> On Wed, Oct 23, 2024 at 09:19:03AM +0200, David Hildenbrand wrote:
> > On 23.10.24 08:24, Dmitry Vyukov wrote:
> > > Hi Florian, Lorenzo,
> > >
> > > This looks great!
>
> Thanks!
>
> > >
> > > What I am VERY interested in is if poisoned pages cause SIGSEGV even when
> > > the access happens in the kernel. Namely, the syscall still returns EFAULT,
> > > but also SIGSEGV is queued on return to user-space.
>
> Yeah we don't in any way.
>
> I think adding something like this would be a bit of its own project.

I can totally understand this.

> The fault andler for this is in handle_pte_marker() in mm/memory.c, where
> we do the following:
>
>         /* Hitting a guard page is always a fatal condition. */
>         if (marker & PTE_MARKER_GUARD)
>                 return VM_FAULT_SIGSEGV;
>
> So basically we pass this back to whoever invoked the fault. For uaccess we
> end up in arch-specific code that eventually checks exception tables
> etc. and for x86-64 that's kernelmode_fixup_or_oops().
>
> There used to be a sig_on_uaccess_err in the x86-specific thread_struct
> that let you propagate it but Linus pulled it out in commit 02b670c1f88e
> ("x86/mm: Remove broken vsyscall emulation code from the page fault code")
> where it was presumably used for vsyscall.
>
> Of course we could just get something much higher up the stack to send the
> signal, but we'd need to be careful we weren't breaking anything doing
> it...

Can setting TIF_NOTIFY_RESUME and then doing the rest when returning
to userspace help here?

> I address GUP below.
>
> > >
> > > Catching bad accesses in system calls is currently the weak spot for
> > > all user-space bug detection tools (GWP-ASan, libefence, libefency, etc).
> > > It's almost possible with userfaultfd, but catching faults in the kernel
> > > requires admin capability, so not really an option for generic bug
> > > detection tools (+inconvinience of userfaultfd setup/handler).
> > > Intercepting all EFAULT from syscalls is not generally possible
> > > (w/o ptrace, usually not an option as well), and EFAULT does not always
> > > mean a bug.
> > >
> > > Triggering SIGSEGV even in syscalls would be not just a performance
> > > optimization, but a new useful capability that would allow it to catch
> > > more bugs.
> >
> > Right, we discussed that offline also as a possible extension to the
> > userfaultfd SIGBUS mode.
> >
> > I did not look into that yet, but I was wonder if there could be cases where
> > a different process could trigger that SIGSEGV, and how to (and if to)
> > handle that.
> >
> > For example, ptrace (access_remote_vm()) -> GUP likely can trigger that. I
> > think with userfaultfd() we will currently return -EFAULT, because we call
> > get_user_page_vma_remote() that is not prepared for dropping the mmap lock.
> > Possibly that is the right thing to do, but not sure :)

That's a good corner case.
I guess also process_vm_readv/writev.
Not triggering the signal in these cases looks like the right thing to do.

> > These "remote" faults set FOLL_REMOTE -> FAULT_FLAG_REMOTE, so we might be
> > able to distinguish them and perform different handling.
>
> So all GUP will return -EFAULT when hitting guard pages unless we change
> something.
>
> In GUP we handle this in faultin_page():
>
>         if (ret & VM_FAULT_ERROR) {
>                 int err = vm_fault_to_errno(ret, flags);
>
>                 if (err)
>                         return err;
>                 BUG();
>         }
>
> And vm_fault_to_errno() is:
>
> static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
> {
>         if (vm_fault & VM_FAULT_OOM)
>                 return -ENOMEM;
>         if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
>                 return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
>         if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
>                 return -EFAULT;
>         return 0;
> }
>
> Again, I think if we wanted special handling here we'd need to probably
> propagate that fault from higher up, but yes we'd need to for one
> definitely not do so if it's remote but I worry about other cases.
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
> Overall while I sympathise with this, it feels dangerous and a pretty major
> change, because there'll be something somewhere that will break because it
> expects faults to be swallowed that we no longer do swallow.
>
> So I'd say it'd be something we should defer, but of course it's a highly
> user-facing change so how easy that would be I don't know.
>
> But I definitely don't think a 'introduce the ability to do cheap PROT_NONE
> guards' series is the place to also fundmentally change how user access
> page faults are handled within the kernel :)

Will delivering signals on kernel access be a backwards compatible
change? Or will we need a different API? MADV_GUARD_POISON_KERNEL?
It's just somewhat painful to detect/update all userspace if we add
this feature in future. Can we say signal delivery on kernel accesses
is unspecified?