----- On Mar 4, 2019, at 1:02 PM, Mathieu Desnoyers mathieu.desnoyers@xxxxxxxxxxxx wrote: > ----- On Feb 28, 2019, at 3:42 AM, Michael Kerrisk mtk.manpages@xxxxxxxxx wrote: > >> On 12/6/18 3:42 PM, Mathieu Desnoyers wrote: >>> [ Michael, rseq(2) was merged into 4.18. Can you have a look at this >>> patch which adds rseq documentation to the man-pages project ? ] >> Hi Matthieu >> >> Sorry for the long delay. I've merged this page into a private >> branch and have done quite a lot of editing. I have many >> questions :-). > > No worries, thanks for looking into it! > >> >> In the first instance, I think it is probably best to have >> a free-form text discussion rather than firing patches >> back and forward. Could you take a look at the questions below >> and respond? > > Sure, Hi Michael, Gentle bump of this email in your inbox, since I suspect you might have forgotten about it altogether. A year ago I you had an heavily edited man page for rseq(2). I provided the requested feedback, but I did not hear back from you since then. We are now close to integrate rseq into glibc, and having an official man page would be useful. Thanks, Mathieu > >> >> Thanks, >> >> Michael >> >> >> RSEQ(2) Linux Programmer's Manual RSEQ(2) >> >> NAME >> rseq - Restartable sequences and CPU number cache >> >> SYNOPSIS >> #include <linux/rseq.h> >> >> int rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig); >> >> DESCRIPTION >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │Imagine you are someone who is pretty new to this │ >> │idea... What is notably lacking from this page is │ >> │an overview explaining: │ >> │ │ >> │ * What a restartable sequence actually is. │ >> │ │ >> │ * An outline of the steps to perform when using │ >> │ restartable sequences / rseq(2). │ >> │ │ >> │I.e., something along the lines of Jon Corbet's │ >> │https://lwn.net/Articles/697979/. Can you come up │ >> │with something? (Part of it might be at the start of │ >> │this page, and the rest in NOTES; it need not be all │ >> │in one place.) │ >> └─────────────────────────────────────────────────────┘ > > We recently published a blog post about rseq, which might contain just the > right level of information we are looking for here: > > https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/ > > Could something along the following lines work ? > > "A restartable sequence is a sequence of instructions guaranteed to be > executed atomically with respect to other threads and signal handlers on the > current CPU. If its execution does not complete atomically, the kernel changes > the execution flow by jumping to an abort handler defined by user-space for > that restartable sequence. > > Using restartable sequences requires to register a __rseq_abi thread-local > storage > data structure (struct rseq) through the rseq(2) system call. Only one > __rseq_abi > can be registered per thread, so user-space libraries and applications must > follow > a user-space ABI defining how to share this resource. The ABI defining how to > share > this resource between applications and libraries is defined by the C library. > > The __rseq_abi contains a rseq_cs field which points to the currently executing > critical section. For each thread, a single rseq critical section can run at any > given point. Each critical section need to be implemented in assembly." > > >> The rseq() ABI accelerates user-space operations on per-CPU data by >> defining a shared data structure ABI between each user-space thread and >> the kernel. >> >> It allows user-space to perform update operations on per-CPU data with‐ >> out requiring heavy-weight atomic operations. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │In the following para: "a hardware execution con‐ │ >> │text"? What is the contrast being drawn here? It │ >> │would be good to state it more explicitly. │ >> └─────────────────────────────────────────────────────┘ > > Here I'm trying to clarify what we mean by "CPU" in this document. We define > a CPU as having its own number returned by sched_getcpu(), which I think is > sometimes referred to as "logical cpu". This is the current hyperthread on > the current core, on the current "physical CPU", in the current socket. > > >> The term CPU used in this documentation refers to a hardware execution >> context. >> >> Restartable sequences are atomic with respect to preemption (making it >> atomic with respect to other threads running on the same CPU), as well >> as signal delivery (user-space execution contexts nested over the same >> thread). They either complete atomically with respect to preemption on >> the current CPU and signal delivery, or they are aborted. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │In the preceding sentence, we need a definition of │ >> │"current CPU". │ >> └─────────────────────────────────────────────────────┘ > > Not sure how to word it. If a thread or signal handler execution context can > possibly run and issue, for instance, "sched_getcpu()" between the beginning > and the end of the critical section and get the same logical CPU number as the > current thread, then we are guaranteed to abort. Of course, sched_getcpu() is > just one way to get the CPU number, considering that we can also read it > from the __rseq_abi cpu_id and cpu_id_start fields. > >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │In the following, does "It is" means "Restartable │ >> │sequences are"? │ >> └─────────────────────────────────────────────────────┘ >> It is suited for update operations on per-CPU data. > > Yes. > > >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │In the following, does "It is" means "Restartable │ >> │sequences are"? │ >> └─────────────────────────────────────────────────────┘ > > "Restartable sequences can be..." > >> It can be used on data structures shared between threads within a >> process, and on data structures shared between threads across different >> processes. >> >> Some examples of operations that can be accelerated or improved by this >> ABI: >> >> · Memory allocator per-CPU free-lists >> >> · Querying the current CPU number >> >> · Incrementing per-CPU counters >> >> · Modifying data protected by per-CPU spinlocks >> >> · Inserting/removing elements in per-CPU linked-lists >> >> · Writing/reading per-CPU ring buffers content >> >> · Accurately reading performance monitoring unit counters with respect >> to thread migration >> >> Restartable sequences must not perform system calls. Doing so may >> result in termination of the process by a segmentation fault. >> >> The rseq argument is a pointer to the thread-local rseq structure to be >> shared between kernel and user-space. The layout of this structure is >> shown below. >> >> The rseq_len argument is the size of the struct rseq to register. >> >> The flags argument is 0 for registration, or RSEQ_FLAG_UNREGISTER for >> unregistration. >> >> The sig argument is the 32-bit signature to be expected before the >> abort handler code. >> >> The rseq structure >> The struct rseq is aligned on a 32-byte boundary. This structure is >> extensible. Its size is passed as parameter to the rseq() system call. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │Below, I added the structure definition (in abbrevi‐ │ >> │ated form). Is there any reason not to do this? │ >> └─────────────────────────────────────────────────────┘ > > It seems appropriate. > >> >> struct rseq { >> __u32 cpu_id_start; >> __u32 cpu_id; >> union { >> __u64 ptr64; >> #ifdef __LP64__ >> __u64 ptr; >> #else >> .... >> #endif >> } rseq_cs; >> __u32 flags; >> } __attribute__((aligned(4 * sizeof(__u64)))); >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │In the text below, I think it would be helpful to │ >> │explicitly note which of these fields are set by the │ >> │kernel (on return from the reseq() call) and which │ >> │are set by the caller (before calling rseq()). Is │ >> │the following correct: │ >> │ │ >> │ cpu_id_start - initialized by caller to possible │ >> │ CPU number (e.g., 0), updated by kernel │ >> │ on return │ > > "initialized by caller to possible CPU number (e.g., 0), updated > by the kernel on return, and updated by the kernel on return after > thread migration to a different CPU" > >> │ │ >> │ cpu_id - initialized to -1 by caller, │ >> │ updated by kernel on return │ > > "initialized to -1 by caller, updated by the kernel on return, and > updated by the kernel on return after thread migration to a different > CPU" > >> │ │ >> │ rseq_cs - initialized by caller, either to NULL │ >> │ or a pointer to an 'rseq_cs' structure │ >> │ that is initialized by the caller │ > > "initialized by caller to NULL, then, after returning from successful > registration, updated to a pointer to an "rseq_cs" structure by user-space. > Set to NULL by the kernel when it restarts a rseq critical section, > when it preempts or deliver a signal outside of the range targeted by the > rseq_cs. Set to NULL by user-space before reclaiming memory that > contains the targeted struct rseq_cs." > > >> │ │ >> │ flags - initialized by caller, used by kernel │ >> └─────────────────────────────────────────────────────┘ >> >> The structure fields are as follows: >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │In the following paragraph, and in later places, I │ >> │changed "current thread" to "calling thread". Okay? │ >> └─────────────────────────────────────────────────────┘ > > Yes. > >> >> cpu_id_start >> Optimistic cache of the CPU number on which the calling thread >> is running. The value in this field is guaranteed to always be >> a possible CPU number, even when rseq is not initialized. The >> value it contains should always be confirmed by reading the >> cpu_id field. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │What does the last sentence mean? │ >> └─────────────────────────────────────────────────────┘ > > It means the caller thread can always use __rseq_abi.cpu_id_start to index an > array of per-cpu data and this won't cause an out-of-bound access on load, but > it > does not mean it really contains the current CPU number. For instance, if rseq > registration failed, it will contain "0". > > Therefore, it's fine to use cpu_is_start to fetch per-cpu data, but the cpu_id > field should be used to compare the cpu_is_start value, so the case where rseq > is not registered is caught. In that case, cpu_id_start=0, but cpu_id=-1 or -2, > which differ, and therefore the critical section needs to jump to the abort > handler. > >> >> This field is an optimistic cache in the sense that it is always >> guaranteed to hold a valid CPU number in the range [0..(nr_pos‐ >> sible_cpus - 1)]. It can therefore be loaded by user-space and >> used as an offset in per-CPU data structures without having to >> check whether its value is within the valid bounds compared to >> the number of possible CPUs in the system. >> >> For user-space applications executed on a kernel without rseq >> support, the cpu_id_start field stays initialized at 0, which is >> indeed a valid CPU number. It is therefore valid to use it as >> an offset in per-CPU data structures, and only validate whether >> it's actually the current CPU number by comparing it with the >> cpu_id field within the rseq critical section. >> >> If the kernel does not provide rseq support, that cpu_id field >> stays initialized at -1, so the comparison always fails, as >> intended. It is then up to user-space to use a fall-back mecha‐ >> nism, considering that rseq is not available. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │The last sentence is rather difficult to grok. Can │ >> │we say some more here? │ >> └─────────────────────────────────────────────────────┘ > > Perhaps we could use the explanation I've written above in my reply ? > >> >> cpu_id Cache of the CPU number on which the calling thread is running. >> -1 if uninitialized. >> >> rseq_cs >> The rseq_cs field is a pointer to a struct rseq_cs (described >> below). It is NULL when no rseq assembly block critical section >> is active for the calling thread. Setting it to point to a >> critical section descriptor (struct rseq_cs) marks the beginning >> of the critical section. >> >> flags Flags indicating the restart behavior for the calling thread. >> This is mainly used for debugging purposes. Can be either: >> >> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT > > Inhibit instruction sequence block restart on preemption for this thread. > >> >> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL > > Inhibit instruction sequence block restart on signal delivery for this thread. > Restart on signal can only be inhibited when restart on preemption and restart > on migration are inhibited too, else it will terminate the offending process > with > a segmentation fault. > >> >> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE > > Inhibit instruction sequence block restart on migration for this thread. > >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │Each of the above values needs an explanation. │ >> │ │ >> │Is it correct that only one of the values may be │ >> │specified in 'flags'? I ask because in the 'rseq_cs' │ >> │structure below, the 'flags' field is a bit mask │ >> │where any combination of these flags may be ORed │ >> │together. │ >> │ │ >> └─────────────────────────────────────────────────────┘ > > Those are also masks and can be ORed. > > >> >> The rseq_cs structure >> The struct rseq_cs is aligned on a 32-byte boundary and has a fixed >> size of 32 bytes. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │Below, I added the structure definition (in abbrevi‐ │ >> │ated form). Is there any reason not to do this? │ >> └─────────────────────────────────────────────────────┘ > > It's fine. > >> >> struct rseq_cs { >> __u32 version; >> __u32 flags; >> __u64 start_ip; >> __u64 post_commit_offset; >> __u64 abort_ip; >> } __attribute__((aligned(4 * sizeof(__u64)))); >> >> The structure fields are as follows: >> >> version >> Version of this structure. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │What does 'version' need to be initialized to? │ >> └─────────────────────────────────────────────────────┘ > > Currently version needs to be 0. Eventually, if we implement support for new > flags to rseq(), > we could add feature flags which register support for newer versions of struct > rseq_cs. > >> >> flags Flags indicating the restart behavior of this structure. Can be >> a combination of: >> >> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT > > Inhibit instruction sequence block restart on preemption for this thread. > >> >> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL > > Inhibit instruction sequence block restart on signal delivery for this thread. > Restart on signal can only be inhibited when restart on preemption and restart > on migration are inhibited too, else it will terminate the offending process > with > a segmentation fault. > >> >> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE > > Inhibit instruction sequence block restart on migration for this thread. > >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │Each of the above values needs an explanation. │ >> └─────────────────────────────────────────────────────┘ >> >> start_ip >> Instruction pointer address of the first instruction of the >> sequence of consecutive assembly instructions. >> >> post_commit_offset >> Offset (from start_ip address) of the address after the last >> instruction of the sequence of consecutive assembly instruc‐ >> tions. >> >> abort_ip >> Instruction pointer address where to move the execution flow in >> case of abort of the sequence of consecutive assembly instruc‐ >> tions. >> >> NOTES >> A single library per process should keep the rseq structure in a >> thread-local storage variable. The cpu_id field should be initialized >> to -1, and the cpu_id_start field should be initialized to a possible >> CPU value (typically 0). > > The part above is not quite right. All applications/libraries wishing to > register > rseq must follow the ABI specified by the C library. It can be defined within > more > that a single application/library, but in the end only one symbol will be chosen > for the process's global symbol table. > >> >> Each thread is responsible for registering and unregistering its rseq >> structure. No more than one rseq structure address can be registered >> per thread at a given time. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │In the following paragraph, what is the difference │ >> │between "freed" and "reclaim"? I'm supposing they │ >> │mean the same thing, but it's not clear. And if they │ >> │do mean the same thing, then the first two sentences │ >> │appear to contain contradictory information. │ >> └─────────────────────────────────────────────────────┘ > > The mean the same thing, and they are subtly not contradictory. > > The first states that memory of a _registered_ rseq object must not > be freed before the thread exits. > > The second states that memory of a rseq object must not be freed before > it is unregistered or the thread exits. > > Do you have an alternative wording in mind to make this clearer ? > >> >> Memory of a registered rseq object must not be freed before the thread >> exits. Reclaim of rseq object's memory must only be done after either >> an explicit rseq unregistration is performed or after the thread exits. >> Keep in mind that the implementation of the Thread-Local Storage (C >> language __thread) lifetime does not guarantee existence of the TLS >> area up until the thread exits. >> >> In a typical usage scenario, the thread registering the rseq structure >> will be performing loads and stores from/to that structure. It is how‐ >> ever also allowed to read that structure from other threads. The rseq >> field updates performed by the kernel provide relaxed atomicity seman‐ >> tics, which guarantee that other threads performing relaxed atomic >> reads of the CPU number cache will always observe a consistent value. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │In the preceding paragraph, can we reasonably add │ >> │some words to explain "relaxed atomicity semantics" │ >> │and "relaxed atomic reads"? │ >> └─────────────────────────────────────────────────────┘ > > Not sure how to word this exactly, but here it means the stores and loads need > to be done atomically, but don't require nor provide any ordering guarantees > with respect to other loads/stores (no memory barriers). > >> >> RETURN VALUE >> A return value of 0 indicates success. On error, -1 is returned, and >> errno is set appropriately. >> >> ERRORS >> EBUSY Restartable sequence is already registered for this thread. >> >> EFAULT rseq is an invalid address. >> >> EINVAL Either flags contains an invalid value, or rseq contains an >> address which is not appropriately aligned, or rseq_len contains >> a size that does not match the size received on registration. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │The last case "rseq_len contains a size that does │ >> │not match the size received on registration" can │ >> │occur only on RSEQ_FLAG_UNREGISTER, tight? │ >> └─────────────────────────────────────────────────────┘ >> >> ENOSYS The rseq() system call is not implemented by this kernel. >> >> EPERM The sig argument on unregistration does not match the signature >> received on registration. >> >> VERSIONS >> The rseq() system call was added in Linux 4.18. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │What is the current state of library support? │ >> └─────────────────────────────────────────────────────┘ > > After going through a few RFC rounds, it's been posted as non-rfc a > few weeks ago. It is pending review from glibc maintainers. I currently > aim for inclusion of the rseq TLS registration by glibc for glibc 2.30: > > https://sourceware.org/ml/libc-alpha/2019-02/msg00317.html > https://sourceware.org/ml/libc-alpha/2019-02/msg00320.html > https://sourceware.org/ml/libc-alpha/2019-02/msg00319.html > https://sourceware.org/ml/libc-alpha/2019-02/msg00318.html > https://sourceware.org/ml/libc-alpha/2019-02/msg00321.html > > Note that the C library will define a user-space ABI which states how > applications/libraries wishing to register the rseq TLS need to behave so they > are compatible with the C library when it gets updated to a new version > providing > rseq registration support. It seems like an important point to document, > perhaps even here in the rseq(2) man page. > > >> >> CONFORMING TO >> rseq() is Linux-specific. >> >> ┌─────────────────────────────────────────────────────┐ >> │FIXME │ >> ├─────────────────────────────────────────────────────┤ >> │Is there any example code that can reasonably be │ >> │included in this manual page? Or some example code │ >> │that can be referred to? │ >> └─────────────────────────────────────────────────────┘ >> > > The per-cpu counter example we have here seems compact enough: > > https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/ > > Thanks, > > Mathieu > > >> SEE ALSO >> sched_getcpu(3), membarrier(2) >> >> -- >> Michael Kerrisk >> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ >> Linux/UNIX System Programming Training: http://man7.org/training/ > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com