On Fri, Dec 17, 2021 at 11:57:52AM +0100, Nicolas Saenz Julienne wrote: > From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > > The entry/exit handling for exceptions, interrupts, syscalls and KVM is > not really documented except for some comments. > > Fill the gaps. > > Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx > Co-developed-by: Nicolas Saenz Julienne <nsaenzju@xxxxxxxxxx> > Signed-off-by: Nicolas Saenz Julienne <nsaenzju@xxxxxxxxxx> > Reviewed-by: Mark Rutland <mark.rutland@xxxxxxx> Reviewed-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > ---- > > Changes since v2: > - No big content changes, just style corrections, so it should be > pretty clean at this stage. In the light of this, I kept Mark's > Reviewed-by. > - Paul's style and paragraph re-writes > - Randy's style comments > - Add links to transition type sections > > Documentation/core-api/entry.rst | 261 +++++++++++++++++++++++++++++++ > Documentation/core-api/index.rst | 8 + > 2 files changed, 269 insertions(+) > create mode 100644 Documentation/core-api/entry.rst > > diff --git a/Documentation/core-api/entry.rst b/Documentation/core-api/entry.rst > new file mode 100644 > index 000000000000..3f80537f2826 > --- /dev/null > +++ b/Documentation/core-api/entry.rst > @@ -0,0 +1,261 @@ > +Entry/exit handling for exceptions, interrupts, syscalls and KVM > +================================================================ > + > +All transitions between execution domains require state updates which are > +subject to strict ordering constraints. State updates are required for the > +following: > + > + * Lockdep > + * RCU / Context tracking > + * Preemption counter > + * Tracing > + * Time accounting > + > +The update order depends on the transition type and is explained below in > +the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular > +exceptions`_, `NMI and NMI-like exceptions`_. > + > +Non-instrumentable code - noinstr > +--------------------------------- > + > +Most instrumentation facilities depend on RCU, so intrumentation is prohibited > +for entry code before RCU starts watching and exit code after RCU stops > +watching. In addition, many architectures must save and restore register state, > +which means that (for example) a breakpoint in the breakpoint entry code would > +overwrite the debug registers of the initial breakpoint. > + > +Such code must be marked with the 'noinstr' attribute, placing that code into a > +special section inaccessible to instrumentation and debug facilities. Some > +functions are partially instrumentable, which is handled by marking them nointr > +and using instrumentation_begin() and instrumentation_end() to flag the > +instrumentable ranges of code: > + > +.. code-block:: c > + > + noinstr void entry(void) > + { > + handle_entry(); // <-- must be 'noinstr' or '__always_inline' > + ... > + > + instrumentation_begin(); > + handle_context(); // <-- instrumentable code > + instrumentation_end(); > + > + ... > + handle_exit(); // <-- must be 'noinstr' or '__always_inline' > + } > + > +This allows verification of the 'noinstr' restrictions via objtool on > +supported architectures. > + > +Invoking non-instrumentable functions from instrumentable context has no > +restrictions and is useful to protect e.g. state switching which would > +cause malfunction if instrumented. > + > +All non-instrumentable entry/exit code sections before and after the RCU > +state transitions must run with interrupts disabled. > + > +Syscalls > +-------- > + > +Syscall-entry code starts in assembly code and calls out into low-level C code > +after establishing low-level architecture-specific state and stack frames. This > +low-level C code must not be instrumented. A typical syscall handling function > +invoked from low-level assembly code looks like this: > + > +.. code-block:: c > + > + noinstr void syscall(struct pt_regs *regs, int nr) > + { > + arch_syscall_enter(regs); > + nr = syscall_enter_from_user_mode(regs, nr); > + > + instrumentation_begin(); > + if (!invoke_syscall(regs, nr) && nr != -1) > + result_reg(regs) = __sys_ni_syscall(regs); > + instrumentation_end(); > + > + syscall_exit_to_user_mode(regs); > + } > + > +syscall_enter_from_user_mode() first invokes enter_from_user_mode() which > +establishes state in the following order: > + > + * Lockdep > + * RCU / Context tracking > + * Tracing > + > +and then invokes the various entry work functions like ptrace, seccomp, audit, > +syscall tracing, etc. After all that is done, the instrumentable invoke_syscall > +function can be invoked. The instrumentable code section then ends, after which > +syscall_exit_to_user_mode() is invoked. > + > +syscall_exit_to_user_mode() handles all work which needs to be done before > +returning to user space like tracing, audit, signals, task work etc. After > +that it invokes exit_to_user_mode() which again handles the state > +transition in the reverse order: > + > + * Tracing > + * RCU / Context tracking > + * Lockdep > + > +syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also > +available as fine grained subfunctions in cases where the architecture code > +has to do extra work between the various steps. In such cases it has to > +ensure that enter_from_user_mode() is called first on entry and > +exit_to_user_mode() is called last on exit. > + > + > +KVM > +--- > + > +Entering or exiting guest mode is very similar to syscalls. From the host > +kernel point of view the CPU goes off into user space when entering the > +guest and returns to the kernel on exit. > + > +kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() > +and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). > +The state operations have the same ordering. > + > +Task work handling is done separately for guest at the boundary of the > +vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of > +the work handled on return to user space. > + > +Interrupts and regular exceptions > +--------------------------------- > + > +Interrupts entry and exit handling is slightly more complex than syscalls > +and KVM transitions. > + > +If an interrupt is raised while the CPU executes in user space, the entry > +and exit handling is exactly the same as for syscalls. > + > +If the interrupt is raised while the CPU executes in kernel space the entry and > +exit handling is slightly different. RCU state is only updated when the > +interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will > +already be watching. Lockdep and tracing have to be updated unconditionally. > + > +irqentry_enter() and irqentry_exit() provide the implementation for this. > + > +The architecture-specific part looks similar to syscall handling: > + > +.. code-block:: c > + > + noinstr void interrupt(struct pt_regs *regs, int nr) > + { > + arch_interrupt_enter(regs); > + state = irqentry_enter(regs); > + > + instrumentation_begin(); > + > + irq_enter_rcu(); > + invoke_irq_handler(regs, nr); > + irq_exit_rcu(); > + > + instrumentation_end(); > + > + irqentry_exit(regs, state); > + } > + > +Note that the invocation of the actual interrupt handler is within a > +irq_enter_rcu() and irq_exit_rcu() pair. > + > +irq_enter_rcu() updates the preemption count which makes in_hardirq() > +return true, handles NOHZ tick state and interrupt time accounting. This > +means that up to the point where irq_enter_rcu() is invoked in_hardirq() > +returns false. > + > +irq_exit_rcu() handles interrupt time accounting, undoes the preemption > +count update and eventually handles soft interrupts and NOHZ tick state. > + > +In theory, the preemption count could be updated in irqentry_enter(). In > +practice, deferring this update to irq_enter_rcu() allows the preemption-count > +code to be traced, while also maintaining symmetry with irq_exit_rcu() and > +irqentry_exit(), which are described in the next paragraph. The only downside > +is that the early entry code up to irq_enter_rcu() must be aware that the > +preemption count has not yet been updated with the HARDIRQ_OFFSET state. > + > +Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count > +before it handles soft interrupts, whose handlers must run in BH context rather > +than irq-disabled context. In addition, irqentry_exit() might schedule, which > +also requires that HARDIRQ_OFFSET has been removed from the preemption count. > + > +NMI and NMI-like exceptions > +--------------------------- > + > +NMIs and NMI-like exceptions (machine checks, double faults, debug > +interrupts, etc.) can hit any context and must be extra careful with > +the state. > + > +State changes for debug exceptions and machine-check exceptions depend on > +whether these exceptions happened in user-space (breakpoints or watchpoints) or > +in kernel mode (code patching). From user-space, they are treated like > +interrupts, while from kernel mode they are treated like NMIs. > + > +NMIs and other NMI-like exceptions handle state transitions without > +distinguishing between user-mode and kernel-mode origin. > + > +The state update on entry is handled in irqentry_nmi_enter() which updates > +state in the following order: > + > + * Preemption counter > + * Lockdep > + * RCU / Context tracking > + * Tracing > + > +The exit counterpart irqentry_nmi_exit() does the reverse operation in the > +reverse order. > + > +Note that the update of the preemption counter has to be the first > +operation on enter and the last operation on exit. The reason is that both > +lockdep and RCU rely on in_nmi() returning true in this case. The > +preemption count modification in the NMI entry/exit case must not be > +traced. > + > +Architecture-specific code looks like this: > + > +.. code-block:: c > + > + noinstr void nmi(struct pt_regs *regs) > + { > + arch_nmi_enter(regs); > + state = irqentry_nmi_enter(regs); > + > + instrumentation_begin(); > + nmi_handler(regs); > + instrumentation_end(); > + > + irqentry_nmi_exit(regs); > + } > + > +and for e.g. a debug exception it can look like this: > + > +.. code-block:: c > + > + noinstr void debug(struct pt_regs *regs) > + { > + arch_nmi_enter(regs); > + > + debug_regs = save_debug_regs(); > + > + if (user_mode(regs)) { > + state = irqentry_enter(regs); > + > + instrumentation_begin(); > + user_mode_debug_handler(regs, debug_regs); > + instrumentation_end(); > + > + irqentry_exit(regs, state); > + } else { > + state = irqentry_nmi_enter(regs); > + > + instrumentation_begin(); > + kernel_mode_debug_handler(regs, debug_regs); > + instrumentation_end(); > + > + irqentry_nmi_exit(regs, state); > + } > + } > + > +There is no combined irqentry_nmi_if_kernel() function available as the > +above cannot be handled in an exception-agnostic way. > diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst > index 5de2c7a4b1b3..972d46a5ddf6 100644 > --- a/Documentation/core-api/index.rst > +++ b/Documentation/core-api/index.rst > @@ -44,6 +44,14 @@ Library functionality that is used throughout the kernel. > timekeeping > errseq > > +Low level entry and exit > +======================== > + > +.. toctree:: > + :maxdepth: 1 > + > + entry > + > Concurrency primitives > ====================== > > -- > 2.33.1 >