Hello Gabriel This is looking much better. Thank you! I have a few more comments still. On 12/28/20 6:38 PM, Gabriel Krisman Bertazi wrote: > Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx> > > --- > Changes since v5: > (suggested by Michael Kerrisk) > - Change () punctuation > - fix grammar > - Add information about interception, return and return value > > Changes since v4: > (suggested by Michael Kerrisk) > - Modify explanation of what dispatch to user space means. > - Drop references to emulation. > - Document suggestion about placing libc in allowed-region. > - Comment about avoiding syscall cost. > Changes since v3: > (suggested by Michael Kerrisk) > - Explain what dispatch to user space means. > - Document the fact that the memory region is a single consecutive > range. > - Explain failure if *arg5 is set to a bad value. > - fix english typo. > - Define what 'invalid memory region' means. > > Changes since v2: > (suggested by Alejandro Colomar) > - selective -> selectively > - Add missing oxford comma. > > Changes since v1: > (suggested by Alejandro Colomar) > - Use semantic lines > - Fix usage of .{B|I}R and .{B|I} > - Don't format literals > - Fix preferred spelling of userspace > - Fix case of word > --- > man2/prctl.2 | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 159 insertions(+) > > diff --git a/man2/prctl.2 b/man2/prctl.2 > index f25f05fdb593..0a0abfb78055 100644 > --- a/man2/prctl.2 > +++ b/man2/prctl.2 > @@ -1533,6 +1533,135 @@ For more information, see the kernel source file > (or > .I Documentation/arm64/sve.txt > before Linux 5.3). > +.TP > +.\" prctl PR_SET_SYSCALL_USER_DISPATCH > +.\" commit 1446e1df9eb183fdf81c3f0715402f1d7595d4 > +.BR PR_SET_SYSCALL_USER_DISPATCH " (since Linux 5.11, x86 only)" > +.IP > +Configure the Syscall User Dispatch mechanism > +for the calling thread. > +This mechanism allows an application > +to selectively intercept system calls > +so that they can be handled within the application itself. > +Interception takes the form of a thread-directed > +.B SIGSYS > +signal that is delivered to the thread > +when it makes a system call. > +If intercepted, > +the system call is not executed by the kernel. > +.IP > +The current Syscall User Dispatch mode is selected via > +.IR arg2 , > +which can either be set to > +.B PR_SYS_DISPATCH_ON > +to enable the feature, > +or to > +.B PR_SYS_DISPATCH_OFF > +to turn it off. So, I realize now that I'm slightly confused. The value of arg2 can be either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. The value of the selector pointed to by arg5 can likewise be R_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. What is the relationship between these two attributes? For example, what does it mean if arg2 isP R_SYS_DISPATCH_ON and, at the time of the prctl() call, the selector has the value PR_SYS_DISPATCH_OFF? > +.IP > +When > +.I arg2 > +is set to > +.BR PR_SYS_DISPATCH_ON , > +.I arg3 > +and > +.I arg4 > +respectively identify the > +.I offset > +and > +.I length > +of a single contiguous memory region in the process map Better: s/map/address space/ ? > +from where system calls are always allowed to be executed, > +regardless of the switch variable s/variable/variable./ > +(Typically, this area would include the area of memory > +containing the C library.) I think just to ease readability (smaller paragraphs), insert .IP here. > +.I arg5 > +points to a char-sized variable > +that is a fast switch to enable/disable the mechanism > +without the overhead of doing a system call. > +The variable pointed by > +.I arg5 > +can either be set to > +.B PR_SYS_DISPATCH_ON > +to enable the mechanism > +or to > +.B PR_SYS_DISPATCH_OFF > +to temporarily disable it. > +This value is checked by the kernel > +on every system call entry, > +and any unexpected value will raise > +an uncatchable > +.B SIGSYS > +at that time, > +killing the application. > +.IP > +When a system call is intercepted, > +the kernel sends a thread-directed > +.B SIGSYS > +signal to the triggering thread. > +Various fields will be set in the > +.I siginfo_t > +structure (see > +.BR sigaction (2)) > +associated with the signal: > +.RS > +.IP * 3 > +.I si_signo > +will contain > +.BR SIGSYS . > +.IP * > +.IR si_call_addr > +will show the address of the system call instruction. > +.IP * > +.IR si_syscall > +and > +.IR si_arch > +will indicate which system call was attempted. > +.IP * > +.I si_code > +will contain > +.BR SYS_USER_DISPATCH . > +.IP * > +.I si_errno > +will be set to 0. > +.RE > +.IP > +The program counter will be as though the system call happened > +(i.e., the program counter will not point to the system call instruction). > +.IP > +When the signal handler returns to the kernel, > +the system call completes immediately > +and returns to the calling thread, > +without actually being executed. > +If necessary > +(i.e., when emulating the system call on user space.), > +the signal handler should set the system call return value > +to a sane value, > +by modifying the register context stored in the > +.I ucontext > +argument of the signal handler. Just for my own education, do you have any example code somewhere that demonstrates setting the syscall return value? Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/