Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 3 Apr 2024 19:00:07 -0700

On Wed, Apr 3, 2024 at 5:58 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>
> On Wed, 3 Apr 2024 09:58:12 -0700
> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
>
> > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
> > >
> > > On Wed, 3 Apr 2024 11:47:41 +0200
> > > Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> > >
> > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > > > Hi Jiri,
> > > > >
> > > > > On Tue,  2 Apr 2024 11:33:00 +0200
> > > > > Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> > > > >
> > > > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > > >
> > > > > This is interesting approach. But I doubt we need to add additional
> > > > > syscall just for this purpose. Can't we use another syscall or ioctl?
> > > >
> > > > so the plan is to optimize entry uprobe in a similar way and given
> > > > the syscall is not a scarce resource I wanted to add another syscall
> > > > for that one as well
> > > >
> > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > > > possible to do that, the trampoline will just have to save one or
> > > > more additional registers, but adding new syscall seems cleaner to me
> > >
> > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
> >
> > I think both ptrace and prctl are for completely different use cases
> > and it would be an abuse of existing API to reuse them for uretprobe
> > tracing. Also, keep in mind, that any extra argument that has to be
> > passed into this syscall means that we need to complicate and slow
> > generated assembly code that is injected into user process (to
> > save/restore registers) and also kernel-side (again, to deal with all
> > the extra registers that would be stored/restored on stack).
> >
> > Given syscalls are not some kind of scarce resources, what's the
> > downside to have a dedicated and simple syscall?
>
> Syscalls are explicitly exposed to user space, thus, even if it is used
> ONLY for a very specific situation, it is an official kernel interface,
> and need to care about the compatibility. (If it causes SIGILL unless
> a specific use case, I don't know there is a "compatibility".)

Check rt_sigreturn syscall (manpage at [0], for example).

       sigreturn() exists only to allow the implementation of signal
       handlers.  It should never be called directly.  (Indeed, a simple
       sigreturn() wrapper in the GNU C library simply returns -1, with
       errno set to ENOSYS.)  Details of the arguments (if any) passed
       to sigreturn() vary depending on the architecture.  (On some
       architectures, such as x86-64, sigreturn() takes no arguments,
       since all of the information that it requires is available in the
       stack frame that was previously created by the kernel on the
       user-space stack.)

This is a very similar use case. Also, check its source code in
arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
on any sign of something not being right. It's exactly the same with
sys_uretprobe.

  [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html

> And the number of syscalls are limited resource.

We have almost 500 of them, it didn't seems like adding 1-2 for good
reasons would be a problem. Can you please point to where the limits
on syscalls as a resource are described? I'm curious to learn.

>
> I'm actually not sure how much we need to care of it, but adding a new
> syscall is worth to be discussed carefully because all of them are
> user-space compatibility.

Absolutely, it's a good discussion to have.

>
> > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > >
> > > > right, I'll check on syzkaller
> > > >
> > > > > set in the user function, what happen if the user function directly
> > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > >
> > > > the process should receive SIGILL if there's no pending uretprobe for
> > > > the current task, or it will trigger uretprobe if there's one pending
> > >
> > > No, that is too aggressive and not safe. Since the syscall is exposed to
> > > user program, it should return appropriate error code instead of SIGILL.
> > >
> >
> > This is the way it is today with uretprobes even through interrupt.
>
> I doubt that the interrupt (exception) and syscall should be handled
> differently. Especially, this exception is injected by uprobes but
> syscall will be caused by itself. But syscall can be called from user
> program (of couse this works as sys_kill(self, SIGILL)).

Yep, I'd keep the behavior the same between uretprobes implemented
through int3 and sys_uretprobe.

>
> > E.g., it could happen that user process is using fibers and is
> > replacing stack pointer without kernel realizing this, which will
> > trigger some defensive checks in uretprobe handling code and kernel
> > will send SIGILL because it can't support such cases. This is
> > happening today already, and it works fine in practice (except for
> > applications that manually change stack pointer, too bad, you can't
> > trace them with uretprobes, unfortunately).
>
> OK, we at least need to document it.

+1, yep

>
> >
> > So I think it's absolutely adequate to have this behavior if the user
> > process is *intentionally* abusing this API.
>
> Of course user expected that it is abusing. So at least we need to
> add a document that this syscall number is reserved to uprobes and
> user program must not use it.
>

Totally agree about documenting this.

> >
> > > >
> > > > but we could limit the syscall to be executed just from the trampoline,
> > > > that should prevent all the user space use cases, I'll do that in next
> > > > version and add more tests for that
> > >
> > > Why not limit? :) The uprobe_handle_trampoline() expects it is called
> > > only from the trampoline, so it is natural to check the caller address.
> > > (and uprobe should know where is the trampoline)
> > >
> > > Since the syscall is always exposed to the user program, it should
> > > - Do nothing and return an error unless it is properly called.
> > > - check the prerequisites for operation strictly.
> > > I concern that new system calls introduce vulnerabilities.
> > >
> >
> > As Oleg and Jiri mentioned, this syscall can't harm kernel or other
> > processes, only the process that is abusing the API. So any extra
> > checks that would slow down this approach is an unnecessary overhead
> > and complication that will never be useful in practice.
>
> I think at least it should check the caller address to ensure the
> address is in the trampoline.
> But anyway, uprobes itself can break the target process, so no one
> might care if this system call breaks the process now.

If we already have an expected range of addresses, then I think it's
fine to do a quick unlikely() check. I'd be more concerned if we need
to do another lookup or search to just validate this. I'm sure Jiri
will figure it out.

>
> >
> > Also note that sys_uretprobe is a kind of internal and unstable API
> > and it is explicitly called out that its contract can change at any
> > time and user space shouldn't rely on it. It's purely for the kernel's
> > own usage.
>
> Is that OK to use a syscall as "internal" and "unstable" API?

See above about rt_sigreturn. It seems like yes, for some highly
specialized syscalls it is the case already.
>
> >
> > So let's please keep it fast and simple.
> >
> >
> > > Thank you,
> > >
> > >
> > > >
> > > > thanks,
> > > > jirka
> > > >
> > > >
> > > > >
> >
> > [...]
>
>
> ([OT] If we can add syscall so casually, I would like to add sys_traceevent
> for recording user space events :-) .)

Have you proposed this upstream? :) I have no clue and no opinion about it...

>
> --
> Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>