Re: [PATCH man-pages] Add rseq manpage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



----- On Mar 4, 2019, at 1:02 PM, Mathieu Desnoyers mathieu.desnoyers@xxxxxxxxxxxx wrote:

> ----- On Feb 28, 2019, at 3:42 AM, Michael Kerrisk mtk.manpages@xxxxxxxxx wrote:
> 
>> On 12/6/18 3:42 PM, Mathieu Desnoyers wrote:
>>> [ Michael, rseq(2) was merged into 4.18. Can you have a look at this
>>>   patch which adds rseq documentation to the man-pages project ? ]
>> Hi Matthieu
>> 
>> Sorry for the long delay. I've merged this page into a private
>> branch and have done quite a lot of editing. I have many
>> questions :-).
> 
> No worries, thanks for looking into it!
> 
>> 
>> In the first instance, I think it is probably best to have
>> a free-form text discussion rather than firing patches
>> back and forward. Could you take a look at the questions below
>> and respond?
> 
> Sure,

Hi Michael,

Gentle bump of this email in your inbox, since I suspect you might have
forgotten about it altogether. A year ago I you had an heavily edited
man page for rseq(2). I provided the requested feedback, but I did not
hear back from you since then.

We are now close to integrate rseq into glibc, and having an official
man page would be useful.

Thanks,

Mathieu


> 
>> 
>> Thanks,
>> 
>> Michael
>> 
>> 
>> RSEQ(2)                    Linux Programmer's Manual                   RSEQ(2)
>> 
>> NAME
>>       rseq - Restartable sequences and CPU number cache
>> 
>> SYNOPSIS
>>       #include <linux/rseq.h>
>> 
>>       int rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig);
>> 
>> DESCRIPTION
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Imagine  you  are  someone who is pretty new to this │
>>       │idea...  What is notably lacking from this  page  is │
>>       │an overview explaining:                              │
>>       │                                                     │
>>       │    * What a restartable sequence actually is.       │
>>       │                                                     │
>>       │    * An outline of the steps to perform when using  │
>>       │    restartable sequences / rseq(2).                 │
>>       │                                                     │
>>       │I.e.,  something  along  the  lines  of Jon Corbet's │
>>       │https://lwn.net/Articles/697979/.  Can you  come  up │
>>       │with something? (Part of it might be at the start of │
>>       │this page, and the rest in NOTES; it need not be all │
>>       │in one place.)                                       │
>>       └─────────────────────────────────────────────────────┘
> 
> We recently published a blog post about rseq, which might contain just the
> right level of information we are looking for here:
> 
> https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
> 
> Could something along the following lines work ?
> 
> "A restartable sequence is a sequence of instructions guaranteed to be
> executed atomically with respect to other threads and signal handlers on the
> current CPU. If its execution does not complete atomically, the kernel changes
> the execution flow by jumping to an abort handler defined by user-space for
> that restartable sequence.
> 
> Using restartable sequences requires to register a __rseq_abi thread-local
> storage
> data structure (struct rseq) through the rseq(2) system call. Only one
> __rseq_abi
> can be registered per thread, so user-space libraries and applications must
> follow
> a user-space ABI defining how to share this resource. The ABI defining how to
> share
> this resource between applications and libraries is defined by the C library.
> 
> The __rseq_abi contains a rseq_cs field which points to the currently executing
> critical section. For each thread, a single rseq critical section can run at any
> given point. Each critical section need to be implemented in assembly."
> 
> 
>>       The  rseq()  ABI  accelerates  user-space operations on per-CPU data by
>>       defining a shared data structure ABI between each user-space thread and
>>       the kernel.
>> 
>>       It allows user-space to perform update operations on per-CPU data with‐
>>       out requiring heavy-weight atomic operations.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In the following para: "a  hardware  execution  con‐ │
>>       │text"?   What  is  the contrast being drawn here? It │
>>       │would be good to state it more explicitly.           │
>>       └─────────────────────────────────────────────────────┘
> 
> Here I'm trying to clarify what we mean by "CPU" in this document. We define
> a CPU as having its own number returned by sched_getcpu(), which I think is
> sometimes referred to as "logical cpu". This is the current hyperthread on
> the current core, on the current "physical CPU", in the current socket.
> 
> 
>>       The term CPU used in this documentation refers to a hardware  execution
>>       context.
>> 
>>       Restartable  sequences are atomic with respect to preemption (making it
>>       atomic with respect to other threads running on the same CPU), as  well
>>       as  signal delivery (user-space execution contexts nested over the same
>>       thread).  They either complete atomically with respect to preemption on
>>       the current CPU and signal delivery, or they are aborted.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  preceding sentence, we need a definition of │
>>       │"current CPU".                                       │
>>       └─────────────────────────────────────────────────────┘
> 
> Not sure how to word it. If a thread or signal handler execution context can
> possibly run and issue, for instance, "sched_getcpu()" between the beginning
> and the end of the critical section and get the same logical CPU number as the
> current thread, then we are guaranteed to abort. Of course, sched_getcpu() is
> just one way to get the CPU number, considering that we can also read it
> from the __rseq_abi cpu_id and cpu_id_start fields.
> 
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In the following, does "It  is"  means  "Restartable │
>>       │sequences are"?                                      │
>>       └─────────────────────────────────────────────────────┘
>>       It is suited for update operations on per-CPU data.
> 
> Yes.
> 
> 
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  following,  does "It is" means "Restartable │
>>       │sequences are"?                                      │
>>       └─────────────────────────────────────────────────────┘
> 
> "Restartable sequences can be..."
> 
>>       It can be used on data  structures  shared  between  threads  within  a
>>       process, and on data structures shared between threads across different
>>       processes.
>> 
>>       Some examples of operations that can be accelerated or improved by this
>>       ABI:
>> 
>>       · Memory allocator per-CPU free-lists
>> 
>>       · Querying the current CPU number
>> 
>>       · Incrementing per-CPU counters
>> 
>>       · Modifying data protected by per-CPU spinlocks
>> 
>>       · Inserting/removing elements in per-CPU linked-lists
>> 
>>       · Writing/reading per-CPU ring buffers content
>> 
>>       · Accurately  reading performance monitoring unit counters with respect
>>         to thread migration
>> 
>>       Restartable sequences must not perform  system  calls.   Doing  so  may
>>       result in termination of the process by a segmentation fault.
>> 
>>       The rseq argument is a pointer to the thread-local rseq structure to be
>>       shared between kernel and user-space.  The layout of this structure  is
>>       shown below.
>> 
>>       The rseq_len argument is the size of the struct rseq to register.
>> 
>>       The  flags  argument is 0 for registration, or RSEQ_FLAG_UNREGISTER for
>>       unregistration.
>> 
>>       The sig argument is the 32-bit signature  to  be  expected  before  the
>>       abort handler code.
>> 
>>   The rseq structure
>>       The  struct  rseq  is aligned on a 32-byte boundary.  This structure is
>>       extensible.  Its size is passed as parameter to the rseq() system call.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Below, I added the structure definition (in abbrevi‐ │
>>       │ated form).  Is there any reason not to do this?     │
>>       └─────────────────────────────────────────────────────┘
> 
> It seems appropriate.
> 
>> 
>>           struct rseq {
>>               __u32             cpu_id_start;
>>               __u32             cpu_id;
>>               union {
>>                   __u64 ptr64;
>>           #ifdef __LP64__
>>                   __u64 ptr;
>>           #else
>>                   ....
>>           #endif
>>               }                 rseq_cs;
>>               __u32             flags;
>>           } __attribute__((aligned(4 * sizeof(__u64))));
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  text  below, I think it would be helpful to │
>>       │explicitly note which of these fields are set by the │
>>       │kernel  (on  return from the reseq() call) and which │
>>       │are set by the caller (before  calling  rseq()).  Is │
>>       │the following correct:                               │
>>       │                                                     │
>>       │    cpu_id_start - initialized by caller to possible │
>>       │    CPU number (e.g., 0), updated by kernel          │
>>       │    on return                                        │
> 
> "initialized by caller to possible CPU number (e.g., 0), updated
> by the kernel on return, and updated by the kernel on return after
> thread migration to a different CPU"
> 
>>       │                                                     │
>>       │    cpu_id - initialized to -1 by caller,            │
>>       │    updated by kernel on return                      │
> 
> "initialized to -1 by caller, updated by the kernel on return, and
> updated by the kernel on return after thread migration to a different
> CPU"
> 
>>       │                                                     │
>>       │    rseq_cs - initialized by caller, either to NULL  │
>>       │    or a pointer to an 'rseq_cs' structure           │
>>       │    that is initialized by the caller                │
> 
> "initialized by caller to NULL, then, after returning from successful
> registration, updated to a pointer to an "rseq_cs" structure by user-space.
> Set to NULL by the kernel when it restarts a rseq critical section,
> when it preempts or deliver a signal outside of the range targeted by the
> rseq_cs. Set to NULL by user-space before reclaiming memory that
> contains the targeted struct rseq_cs."
> 
> 
>>       │                                                     │
>>       │    flags - initialized by caller, used by kernel    │
>>       └─────────────────────────────────────────────────────┘
>> 
>>       The structure fields are as follows:
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  following paragraph, and in later places, I │
>>       │changed "current thread" to "calling thread". Okay?  │
>>       └─────────────────────────────────────────────────────┘
> 
> Yes.
> 
>> 
>>       cpu_id_start
>>              Optimistic cache of the CPU number on which the  calling  thread
>>              is  running.  The value in this field is guaranteed to always be
>>              a possible CPU number, even when rseq is not  initialized.   The
>>              value  it  contains  should  always  be confirmed by reading the
>>              cpu_id field.
>> 
>>              ┌─────────────────────────────────────────────────────┐
>>              │FIXME                                                │
>>              ├─────────────────────────────────────────────────────┤
>>              │What does the last sentence mean?                    │
>>              └─────────────────────────────────────────────────────┘
> 
> It means the caller thread can always use __rseq_abi.cpu_id_start to index an
> array of per-cpu data and this won't cause an out-of-bound access on load, but
> it
> does not mean it really contains the current CPU number. For instance, if rseq
> registration failed, it will contain "0".
> 
> Therefore, it's fine to use cpu_is_start to fetch per-cpu data, but the cpu_id
> field should be used to compare the cpu_is_start value, so the case where rseq
> is not registered is caught. In that case, cpu_id_start=0, but cpu_id=-1 or -2,
> which differ, and therefore the critical section needs to jump to the abort
> handler.
> 
>> 
>>              This field is an optimistic cache in the sense that it is always
>>              guaranteed  to hold a valid CPU number in the range [0..(nr_pos‐
>>              sible_cpus - 1)].  It can therefore be loaded by user-space  and
>>              used  as  an offset in per-CPU data structures without having to
>>              check whether its value is within the valid bounds  compared  to
>>              the number of possible CPUs in the system.
>> 
>>              For  user-space  applications  executed on a kernel without rseq
>>              support, the cpu_id_start field stays initialized at 0, which is
>>              indeed  a  valid CPU number.  It is therefore valid to use it as
>>              an offset in per-CPU data structures, and only validate  whether
>>              it's  actually  the  current CPU number by comparing it with the
>>              cpu_id field within the rseq critical section.
>> 
>>              If the kernel does not provide rseq support, that  cpu_id  field
>>              stays  initialized  at  -1,  so  the comparison always fails, as
>>              intended.  It is then up to user-space to use a fall-back mecha‐
>>              nism, considering that rseq is not available.
>> 
>>              ┌─────────────────────────────────────────────────────┐
>>              │FIXME                                                │
>>              ├─────────────────────────────────────────────────────┤
>>              │The  last  sentence is rather difficult to grok. Can │
>>              │we say some more here?                               │
>>              └─────────────────────────────────────────────────────┘
> 
> Perhaps we could use the explanation I've written above in my reply ?
> 
>> 
>>       cpu_id Cache of the CPU number on which the calling thread is  running.
>>              -1 if uninitialized.
>> 
>>       rseq_cs
>>              The  rseq_cs  field  is a pointer to a struct rseq_cs (described
>>              below).  It is NULL when no rseq assembly block critical section
>>              is  active  for  the  calling  thread.  Setting it to point to a
>>              critical section descriptor (struct rseq_cs) marks the beginning
>>              of the critical section.
>> 
>>       flags  Flags  indicating  the  restart behavior for the calling thread.
>>              This is mainly used for debugging purposes.  Can be either:
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
> 
> Inhibit instruction sequence block restart on preemption for this thread.
> 
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> 
> Inhibit instruction sequence block restart on signal delivery for this thread.
> Restart on signal can only be inhibited when restart on preemption and restart
> on migration are inhibited too, else it will terminate the offending process
> with
> a segmentation fault.
> 
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> 
> Inhibit instruction sequence block restart on migration for this thread.
> 
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Each of the above values needs an explanation.       │
>>       │                                                     │
>>       │Is it correct that only one of  the  values  may  be │
>>       │specified in 'flags'? I ask because in the 'rseq_cs' │
>>       │structure below, the 'flags' field  is  a  bit  mask │
>>       │where  any  combination  of  these flags may be ORed │
>>       │together.                                            │
>>       │                                                     │
>>       └─────────────────────────────────────────────────────┘
> 
> Those are also masks and can be ORed.
> 
> 
>> 
>>   The rseq_cs structure
>>       The struct rseq_cs is aligned on a 32-byte boundary  and  has  a  fixed
>>       size of 32 bytes.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Below, I added the structure definition (in abbrevi‐ │
>>       │ated form).  Is there any reason not to do this?     │
>>       └─────────────────────────────────────────────────────┘
> 
> It's fine.
> 
>> 
>>           struct rseq_cs {
>>               __u32   version;
>>               __u32   flags;
>>               __u64   start_ip;
>>               __u64   post_commit_offset;
>>               __u64   abort_ip;
>>           } __attribute__((aligned(4 * sizeof(__u64))));
>> 
>>       The structure fields are as follows:
>> 
>>       version
>>              Version of this structure.
>> 
>>              ┌─────────────────────────────────────────────────────┐
>>              │FIXME                                                │
>>              ├─────────────────────────────────────────────────────┤
>>              │What does 'version' need to be initialized to?       │
>>              └─────────────────────────────────────────────────────┘
> 
> Currently version needs to be 0. Eventually, if we implement support for new
> flags to rseq(),
> we could add feature flags which register support for newer versions of struct
> rseq_cs.
> 
>> 
>>       flags  Flags indicating the restart behavior of this structure.  Can be
>>              a combination of:
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
> 
> Inhibit instruction sequence block restart on preemption for this thread.
> 
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> 
> Inhibit instruction sequence block restart on signal delivery for this thread.
> Restart on signal can only be inhibited when restart on preemption and restart
> on migration are inhibited too, else it will terminate the offending process
> with
> a segmentation fault.
> 
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> 
> Inhibit instruction sequence block restart on migration for this thread.
> 
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Each of the above values needs an explanation.       │
>>       └─────────────────────────────────────────────────────┘
>> 
>>       start_ip
>>              Instruction  pointer  address  of  the  first instruction of the
>>              sequence of consecutive assembly instructions.
>> 
>>       post_commit_offset
>>              Offset (from start_ip address) of the  address  after  the  last
>>              instruction  of  the  sequence  of consecutive assembly instruc‐
>>              tions.
>> 
>>       abort_ip
>>              Instruction pointer address where to move the execution flow  in
>>              case  of  abort of the sequence of consecutive assembly instruc‐
>>              tions.
>> 
>> NOTES
>>       A single library per process  should  keep  the  rseq  structure  in  a
>>       thread-local  storage variable.  The cpu_id field should be initialized
>>       to -1, and the cpu_id_start field should be initialized to  a  possible
>>       CPU value (typically 0).
> 
> The part above is not quite right. All applications/libraries wishing to
> register
> rseq must follow the ABI specified by the C library. It can be defined within
> more
> that a single application/library, but in the end only one symbol will be chosen
> for the process's global symbol table.
> 
>> 
>>       Each  thread  is responsible for registering and unregistering its rseq
>>       structure.  No more than one rseq structure address can  be  registered
>>       per thread at a given time.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  following paragraph, what is the difference │
>>       │between "freed" and "reclaim"?  I'm  supposing  they │
>>       │mean the same thing, but it's not clear. And if they │
>>       │do mean the same thing, then the first two sentences │
>>       │appear to contain contradictory information.         │
>>       └─────────────────────────────────────────────────────┘
> 
> The mean the same thing, and they are subtly not contradictory.
> 
> The first states that memory of a _registered_ rseq object must not
> be freed before the thread exits.
> 
> The second states that memory of a rseq object must not be freed before
> it is unregistered or the thread exits.
> 
> Do you have an alternative wording in mind to make this clearer ?
> 
>> 
>>       Memory  of a registered rseq object must not be freed before the thread
>>       exits.  Reclaim of rseq object's memory must only be done after  either
>>       an explicit rseq unregistration is performed or after the thread exits.
>>       Keep in mind that the implementation of  the  Thread-Local  Storage  (C
>>       language  __thread)  lifetime  does  not guarantee existence of the TLS
>>       area up until the thread exits.
>> 
>>       In a typical usage scenario, the thread registering the rseq  structure
>>       will be performing loads and stores from/to that structure.  It is how‐
>>       ever also allowed to read that structure from other threads.  The  rseq
>>       field  updates performed by the kernel provide relaxed atomicity seman‐
>>       tics, which guarantee that  other  threads  performing  relaxed  atomic
>>       reads of the CPU number cache will always observe a consistent value.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  preceding  paragraph, can we reasonably add │
>>       │some words to explain "relaxed atomicity  semantics" │
>>       │and "relaxed atomic reads"?                          │
>>       └─────────────────────────────────────────────────────┘
> 
> Not sure how to word this exactly, but here it means the stores and loads need
> to be done atomically, but don't require nor provide any ordering guarantees
> with respect to other loads/stores (no memory barriers).
> 
>> 
>> RETURN VALUE
>>       A  return  value of 0 indicates success.  On error, -1 is returned, and
>>       errno is set appropriately.
>> 
>> ERRORS
>>       EBUSY  Restartable sequence is already registered for this thread.
>> 
>>       EFAULT rseq is an invalid address.
>> 
>>       EINVAL Either flags contains an invalid  value,  or  rseq  contains  an
>>              address which is not appropriately aligned, or rseq_len contains
>>              a size that does not match the size received on registration.
>> 
>>              ┌─────────────────────────────────────────────────────┐
>>              │FIXME                                                │
>>              ├─────────────────────────────────────────────────────┤
>>              │The last case "rseq_len contains a  size  that  does │
>>              │not  match  the  size  received on registration" can │
>>              │occur only on RSEQ_FLAG_UNREGISTER, tight?           │
>>              └─────────────────────────────────────────────────────┘
>> 
>>       ENOSYS The rseq() system call is not implemented by this kernel.
>> 
>>       EPERM  The sig argument on unregistration does not match the  signature
>>              received on registration.
>> 
>> VERSIONS
>>       The rseq() system call was added in Linux 4.18.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │What is the current state of library support?        │
>>       └─────────────────────────────────────────────────────┘
> 
> After going through a few RFC rounds, it's been posted as non-rfc a
> few weeks ago. It is pending review from glibc maintainers. I currently
> aim for inclusion of the rseq TLS registration by glibc for glibc 2.30:
> 
> https://sourceware.org/ml/libc-alpha/2019-02/msg00317.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00320.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00319.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00318.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00321.html
> 
> Note that the C library will define a user-space ABI which states how
> applications/libraries wishing to register the rseq TLS need to behave so they
> are compatible with the C library when it gets updated to a new version
> providing
> rseq registration support. It seems like an important point to document,
> perhaps even here in the rseq(2) man page.
> 
> 
>> 
>> CONFORMING TO
>>       rseq() is Linux-specific.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Is  there  any  example  code that can reasonably be │
>>       │included in this manual page? Or some  example  code │
>>       │that can be referred to?                             │
>>       └─────────────────────────────────────────────────────┘
>> 
> 
> The per-cpu counter example we have here seems compact enough:
> 
> https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
> 
> Thanks,
> 
> Mathieu
> 
> 
>> SEE ALSO
>>       sched_getcpu(3), membarrier(2)
>> 
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux