On Wed, Apr 3, 2024 at 5:58 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote: > > On Wed, 3 Apr 2024 09:58:12 -0700 > Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote: > > > > > > On Wed, 3 Apr 2024 11:47:41 +0200 > > > Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote: > > > > > Hi Jiri, > > > > > > > > > > On Tue, 2 Apr 2024 11:33:00 +0200 > > > > > Jiri Olsa <jolsa@xxxxxxxxxx> wrote: > > > > > > > > > > > Adding uretprobe syscall instead of trap to speed up return probe. > > > > > > > > > > This is interesting approach. But I doubt we need to add additional > > > > > syscall just for this purpose. Can't we use another syscall or ioctl? > > > > > > > > so the plan is to optimize entry uprobe in a similar way and given > > > > the syscall is not a scarce resource I wanted to add another syscall > > > > for that one as well > > > > > > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's > > > > possible to do that, the trampoline will just have to save one or > > > > more additional registers, but adding new syscall seems cleaner to me > > > > > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate. > > > > I think both ptrace and prctl are for completely different use cases > > and it would be an abuse of existing API to reuse them for uretprobe > > tracing. Also, keep in mind, that any extra argument that has to be > > passed into this syscall means that we need to complicate and slow > > generated assembly code that is injected into user process (to > > save/restore registers) and also kernel-side (again, to deal with all > > the extra registers that would be stored/restored on stack). > > > > Given syscalls are not some kind of scarce resources, what's the > > downside to have a dedicated and simple syscall? > > Syscalls are explicitly exposed to user space, thus, even if it is used > ONLY for a very specific situation, it is an official kernel interface, > and need to care about the compatibility. (If it causes SIGILL unless > a specific use case, I don't know there is a "compatibility".) Check rt_sigreturn syscall (manpage at [0], for example). sigreturn() exists only to allow the implementation of signal handlers. It should never be called directly. (Indeed, a simple sigreturn() wrapper in the GNU C library simply returns -1, with errno set to ENOSYS.) Details of the arguments (if any) passed to sigreturn() vary depending on the architecture. (On some architectures, such as x86-64, sigreturn() takes no arguments, since all of the information that it requires is available in the stack frame that was previously created by the kernel on the user-space stack.) This is a very similar use case. Also, check its source code in arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process on any sign of something not being right. It's exactly the same with sys_uretprobe. [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html > And the number of syscalls are limited resource. We have almost 500 of them, it didn't seems like adding 1-2 for good reasons would be a problem. Can you please point to where the limits on syscalls as a resource are described? I'm curious to learn. > > I'm actually not sure how much we need to care of it, but adding a new > syscall is worth to be discussed carefully because all of them are > user-space compatibility. Absolutely, it's a good discussion to have. > > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is > > > > > > > > right, I'll check on syzkaller > > > > > > > > > set in the user function, what happen if the user function directly > > > > > calls this syscall? (maybe it consumes shadow stack?) > > > > > > > > the process should receive SIGILL if there's no pending uretprobe for > > > > the current task, or it will trigger uretprobe if there's one pending > > > > > > No, that is too aggressive and not safe. Since the syscall is exposed to > > > user program, it should return appropriate error code instead of SIGILL. > > > > > > > This is the way it is today with uretprobes even through interrupt. > > I doubt that the interrupt (exception) and syscall should be handled > differently. Especially, this exception is injected by uprobes but > syscall will be caused by itself. But syscall can be called from user > program (of couse this works as sys_kill(self, SIGILL)). Yep, I'd keep the behavior the same between uretprobes implemented through int3 and sys_uretprobe. > > > E.g., it could happen that user process is using fibers and is > > replacing stack pointer without kernel realizing this, which will > > trigger some defensive checks in uretprobe handling code and kernel > > will send SIGILL because it can't support such cases. This is > > happening today already, and it works fine in practice (except for > > applications that manually change stack pointer, too bad, you can't > > trace them with uretprobes, unfortunately). > > OK, we at least need to document it. +1, yep > > > > > So I think it's absolutely adequate to have this behavior if the user > > process is *intentionally* abusing this API. > > Of course user expected that it is abusing. So at least we need to > add a document that this syscall number is reserved to uprobes and > user program must not use it. > Totally agree about documenting this. > > > > > > > > > > but we could limit the syscall to be executed just from the trampoline, > > > > that should prevent all the user space use cases, I'll do that in next > > > > version and add more tests for that > > > > > > Why not limit? :) The uprobe_handle_trampoline() expects it is called > > > only from the trampoline, so it is natural to check the caller address. > > > (and uprobe should know where is the trampoline) > > > > > > Since the syscall is always exposed to the user program, it should > > > - Do nothing and return an error unless it is properly called. > > > - check the prerequisites for operation strictly. > > > I concern that new system calls introduce vulnerabilities. > > > > > > > As Oleg and Jiri mentioned, this syscall can't harm kernel or other > > processes, only the process that is abusing the API. So any extra > > checks that would slow down this approach is an unnecessary overhead > > and complication that will never be useful in practice. > > I think at least it should check the caller address to ensure the > address is in the trampoline. > But anyway, uprobes itself can break the target process, so no one > might care if this system call breaks the process now. If we already have an expected range of addresses, then I think it's fine to do a quick unlikely() check. I'd be more concerned if we need to do another lookup or search to just validate this. I'm sure Jiri will figure it out. > > > > > Also note that sys_uretprobe is a kind of internal and unstable API > > and it is explicitly called out that its contract can change at any > > time and user space shouldn't rely on it. It's purely for the kernel's > > own usage. > > Is that OK to use a syscall as "internal" and "unstable" API? See above about rt_sigreturn. It seems like yes, for some highly specialized syscalls it is the case already. > > > > > So let's please keep it fast and simple. > > > > > > > Thank you, > > > > > > > > > > > > > > thanks, > > > > jirka > > > > > > > > > > > > > > > > > [...] > > > ([OT] If we can add syscall so casually, I would like to add sys_traceevent > for recording user space events :-) .) Have you proposed this upstream? :) I have no clue and no opinion about it... > > -- > Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>