Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Fri, 1 Oct 2021 11:28:58 -0400 (EDT)

----- On Sep 29, 2021, at 7:57 PM, paulmck paulmck@xxxxxxxxxx wrote:

> On Wed, Sep 29, 2021 at 04:47:03PM -0500, Segher Boessenkool wrote:
>> Hi!
>> 
>> On Tue, Sep 28, 2021 at 05:15:07PM -0400, Mathieu Desnoyers wrote:
>> > C99 describes that accessing volatile objects are side-effects, and that
>> > "at certain specified points in the execution sequence called sequence
>> > points, all side effects of previous evaluations shall be complete
>> > and no side effects of subsequent evaluations shall have taken
>> > place". [2]
>> 
>> But note that the kernel explicitly uses C89 (with GNU extensions).
>> Side effects are largely equal there though.
>> 
>> Also note that there may no place in the generated machine code that
>> corresponds exactly to some sequence point.  Sequence points are a
>> concept that applies to the source program and how that executes on the
>> abstract machine.
> 
> Plus the "as if" rule rears its ugly head in many of these situations.
> 
>> > +Because ctrl_dep emits distinct asm volatile within each leg of the if
>> > +statement, the compiler cannot transform the two writes to 'b' into a
>> > +conditional-move (cmov) instruction, thus ensuring the presence of a
>> > +conditional branch.  Also because the ctrl_dep emits asm volatile within
>> > +each leg of the if statement, the compiler cannot move the write to 'c'
>> > +before the conditional branch.
>> 
>> I think your reasoning here misses some things.  So many that I don't
>> know where to start to list them, every "because" and "thus" here does
>> not follow, and even the statements of fact are not a given.
>> 
>> Why do you want a conditional branch insn at all, anyway?  You really
>> want something else as far as I can see.
> 
> Because at the assembly language level on some architectures, a
> conditional branch instruction provides weak but very real and very
> useful memory-ordering properties.  Such a branch orders all loads
> whose return values feed into the branch condition before any stores
> that execute after the branch does (regardless of whether or not the
> branch was taken).  And this is all the ordering that is required for
> the use cases that Mathieu is worried about.
> 
> Yes, you can use explicit memory-barrier or acquire-load instructions,
> but those incur more overhead on some types of hardware.  The code in
> question is on a hotpath and is thus performance-critical.
> 
> It would be nice to be able to somehow tell the compiler exactly
> what the ordering constraints are ("this particular load must be
> ordered before these particular stores") and then let it (1) figure
> out that a conditional branch will do the trick and (2) generate the
> code accordingly.  But last I checked, this was not going to happen any
> time soon.  So for the time being, we have to live within the current
> capability of the tools that are available to us.
> 
> Linus points out that in all the actual control-dependent code in
> the Linux kernel, the compiler is going to be hard-pressed to fail
> to emit the required branch.  (Or in the case of ARMv8, the required
> conditional-move instruction.)
> 
> Mathieu, for his part, recently read the relevant portions of
> memory-barriers.txt (reproduced below) and would like to simplify these
> coding guidlines, which, speaking as the author of those guidelines,
> would be an extremely good thing.  His patches are attempting to move
> us in that direction.
> 
> Alternatives include: (1) Using acquire loads or memory barriers
> and accepting the loss in performance, but giving the compiler much
> less leeway, (2) Ripping all of the two-legged "if" examples from
> memory-barriers.txt and restricting control dependencies to else-less
> "if" statements, again giving the compiler less leeway, and (3) Your
> ideas here.
> 
> Does that help, or am I missing your point?

Thanks Paul, it does help explaining the motivation for relying on
control dependencies for some fast-path memory ordering in the kernel.

And yes, my main goal is to simplify the coding guide lines, but I have
not found any example of bad generated code in the tree kernel at this
point. In some cases (e.g. uses of smp_acquire__after_ctrl_dep()) it's
mainly thanks to luck though.

There is another alternative we could list here: implement ctrl_dep_true(),
ctrl_dep_false() and ctrl_dep(), which would respectively ensure a
control dependency on the then leg, on the else leg, or on both legs of
a conditional expression evaluation.

> 
>> It is essential here that there is a READ_ONCE and the WRITE_ONCE.
>> Those things might make it work the way you want, but as Linus says this
>> is all way too subtle.  Can you include the *_ONCE into the primitive
>> itself somehow?
> 
> Actually, if the store is not involved in a data race, the WRITE_ONCE()
> is not needed.  And in that case, the compiler is much less able to
> fail to provide the needed ordering.  (No, the current documentation
> does not reflect this.)  But if there is a data race, then your point
> is right on the mark -- that WRITE_ONCE() cannot be safely omitted.
> 
> But you are absolutely right that the READ_ONCE() or equivalent is not
> at all optional.  An example of an acceptable equivalent is an atomic
> read-modify-write operation such as atomic_xchg_relaxed().
> 
> The question about whether the READ_ONCE() and WRITE_ONCE() can be
> incorporated into the macro I leave to Mathieu.  I can certainly see
> serious benefits from this approach, at least from a compiler viewpoint.
> I must reserve judgment on usability until I see a proposal.

[...]

After having audited thoroughly all obviously documented control dependencies
in the kernel tree, I'm not sure that including the READ_ONCE() and WRITE_ONCE()
with the ctrl_dep() macro is a good idea, because in some cases there is
calculation to be done on the result of the READ_ONCE() (e.g. through
a static inline) before handing it over to the conditional expression.
In other cases many stores are being done after the control dependency, e.g.:

kernel/events/ring_buffer.c:__perf_output_begin()

        do {
                tail = READ_ONCE(rb->user_page->data_tail);
                offset = head = local_read(&rb->head);
                if (!rb->overwrite) {
                        if (unlikely(!ring_buffer_has_space(head, tail,
                                                            perf_data_size(rb),
                                                            size, backward)))
                                goto fail;
                }

                /*
                 * The above forms a control dependency barrier separating the
                 * @tail load above from the data stores below. Since the @tail
                 * load is required to compute the branch to fail below.
                 *
                 * A, matches D; the full memory barrier userspace SHOULD issue
                 * after reading the data and before storing the new tail
                 * position.
                 *
                 * See perf_output_put_handle().
                 */

                if (!backward)
                        head += size;
                else
                        head -= size;
        } while (local_cmpxchg(&rb->head, offset, head) != offset);

        if (backward) {
                offset = head;
                head = (u64)(-head);
        }

        /*
         * We rely on the implied barrier() by local_cmpxchg() to ensure
         * none of the data stores below can be lifted up by the compiler.
         */

[...]

Note that I suspect that this control dependency documentation could be
improved to state that the first control dependency is with the following
local_cmpxchg store, which itself has a control dependency (when it evaluates
to false) with the following stores to the ring buffer. Those are not volatile
stores, but the "memory" clobber with the local_cmpxchg should ensure that
following stores are after the local_cmpxchg in program order.

One other thing we could do to improve things slightly would be to turn
smp_acquire__after_ctrl_dep() into something which really is an acquire
in all cases, which may not currently be true if the compiler finds a
matching barrier()/smp_rmb() in the other leg after the conditional
expression.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com