Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Fri, 11 Dec 2015 12:05:36 +0000 (UTC)

----- On Oct 27, 2015, at 7:56 PM, Paul Turner commonly@xxxxxxxxx wrote:

> This is an update to the previously posted series at:
>  https://lkml.org/lkml/2015/6/24/665
> 
> Dave Watson has posted a similar follow-up which allows additional critical
> regions to be registered as well as single-step support at:
>  https://lkml.org/lkml/2015/10/22/588
> 
> This series is a new approach which introduces an alternate ABI that does not
> depend on open-coded assembly nor a central 'repository' of rseq sequences.
> Sequences may now be inlined and the preparatory[*] work for the sequence can
> be written in a higher level language.

Hi Paul,

Thanks for sending this updated series! A few questions:

When you say that the preparatory work can be written in a higher
level language, does it mean it still has to be inlined, or function
calls are explicitly supported ? It would be good to clarify this in
the changelog/documentation.

Another point: in this changelog, it's unclear to me what the [*] refers
to (after "preparatory"). Below there is a [*] before "A sequence essentially
has 3 steps", and and plain '*' before "0. Userspace stores current event+cpu
counter values".

> 
> This new ABI has also been written to support debugger interaction in a way
> that the previous ABI could not.
> 
> [*] A sequence essentially has 3 steps:
>  1) Determine which cpu the sequence is being run on
>  2) Preparatory work specific to the state read in 1)
>  3) A single commit instruction which finalizes any state updates.

The link between those 3 steps and the following "progress" items
below seems a bit confusing, because those are two numbered lists.
Perhaps using the numbers above for the overall steps, and sub-numbers
after a dot could help exposing the relation, e.g.:

1.1 Userspace stores current event+cpu counter values
2.1 Userspace loads the rip to move to at failure into cx
2.2 Userspace loads the rip of the instruction following
    the critical section into a registered TLS address.
....

> 
> We require a single instruction for (3) so that if it is interrupted in any
> way, we can proceed from (1) once more [or potentially bail].
> 
> This new ABI can be described as:
> Progress is ordered as follows:
>  *0. Userspace stores current event+cpu counter values

This does not explain where those "current" values are
read from ? Moreover, step [3] below seems to refer to values
read at [0], but [0] is about storing values, not reading them.
There is probably a very simple explanation to this, but in its
current form, this is puzzling. Moreover, where is it stored ?

I would suggest to change event+cpu -> (event, cpu), since they
are not summed, but rather combined as a tuple.

>   1. Userspace loads the rip to move to at failure into cx

Is it really cx (16-bit), or ecx/rcx ?

"the rip to move to at failure" could be defined as
"failure address" or "restart address", and then referred
to afterward. This would clarify point [4] below rather than
saying "to the address from [1]".

>   2. Userspace loads the rip of the instruction following
>      the critical section into a registered TLS address.

The critical section is not clearly defined. What is the
first instruction included in that c.s., and what is the
first instruction after the c.s. ?

>   3. Userspace loads the values read at [0] into a known
>      location.

How can we "load the value ... into" ? We usually load from,
and store to a location. What I think you mean here is that
Userspace loads the values read at [0] into a known register.
(I would expect location target a memory address).

>   4. Userspace tests to see whether the current event and
>      cpu counter values match those stored at 0.  Manually
>      jumping to the address from [1] in the case of a
>      mismatch.

address from [1] -> "failure/restart address"

> 
>      Note that if we are preempted or otherwise interrupted
>      then the kernel can also now perform this comparison
>      and conditionally jump us to [1].

Does it jump us to [1] or to the failure/restart address ?

>   5. Our final instruction before [2] is then our commit.
>      The critical section is self-terminating.  [2] must
>      also be cleared at this point.

Is it the step [2] or the instruction following the critical
section to you refer to here ?

> 
> For x86_64:
>   [3] uses rdx to represent cpu and event counter as a
>       single 64-bit value.
> 
> For i386:
>   [3] uses ax for cpu and dx for the event_counter.

Having 16 bit only for CPU id limits us to 65536 CPUs.
Can this be an issue ?

Having 16 bit only for the event counter may start to become
tricky for counter overflows.

Are we strictly required to put both cpu id and seq counter
into the same register, or those could be put into two regs ?

> 
>  Both:
>   Instruction after commit: rseq_state->post_commit_instr
>   Current event and cpu state: rseq_state->event_and_cpu
> 
> Exactly, for x86_64 this looks like:
>  movq <failed>, rcx [1]
>  movq $1f, <commit_instr> [2]
>  cmpq <start value>, <current value> [3] (start is in rcx)

As you stated yourself in a reply, rcx -> rdx.

>  jnz <failed> (4)
>  movq <to_write>, (<target>) (5)
>  1: movq $0, <commit_instr>
> 
> There has been some related discussion, which I am supportive of, in which
> we use fs/gs instead of TLS.  This maps naturally to the above and removes
> the current requirement for per-thread initialization (this is a good thing!).

It would be important to keep allowing other architectures to use TLS-based
approaches though.

> 
> On debugger interactions:
> 
> There are some nice properties about this new style of API which allow it to
> actually support safe interactions with a debugger:
> a) The event counter is a per-cpu value.  This means that we can not advance
>    it if no threads from the same process execute on that cpu.  This
>    naturally allows basic single step support with thread-isolation.
> b) Single-step can be augmented to evalute the ABI without incrementing the
>    event count.
> c) A debugger can also be augmented to evaluate this ABI and push restarts
>    on the kernel's behalf.
> 
> This is also compatible with David's approach of not single stepping between
> 2-4 above.  However, I think these are ultimately a little stronger since true
> single-stepping and breakpoint support would be available.  Which would be
> nice to allow actual debugging of sequences.
> 
> (Note that I haven't bothered implementing these in the current patchset as we
> are still winnowing down on the ABI and they just add complexity.  It's
> important to note that they are possible however.)

Ensuring these restartable critical sections can be used on shared memory
across processes seems rather important. I wonder if the debuggability of the
critical section really justifies to introduce a "PRIVATE" and a "SHARED" mode
for the rseq (similar to sys_futex), or if we could just enforce that no
breakpoint should be inserted within the critical section (except the very
first instruction).

Thoughts ?

Thanks,

Mathieu

> 
> Thanks,
> 
> - Paul

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html