Re: [PATCH 1/2] compiler.h: Introduce ptr_eq() to preserve address dependency

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Sun, 29 Sep 2024 06:36:16 -0400

On 2024-09-29 01:24, Gary Guo wrote:
Cc: Nikita Popov <github@xxxxxxxxxx>
Cc: llvm@xxxxxxxxxxxxxxx

On Sat, 28 Sep 2024 09:51:27 -0400
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

Compiler CSE and SSA GVN optimizations can cause the address dependency
of addresses returned by rcu_dereference to be lost when comparing those
pointers with either constants or previously loaded pointers.

Introduce ptr_eq() to compare two addresses while preserving the address
dependencies for later use of the address. It should be used when
comparing an address returned by rcu_dereference().

This is needed to prevent the compiler CSE and SSA GVN optimizations
from replacing the registers holding @a or @b based on their
equality, which does not preserve address dependencies and allows the
following misordering speculations:

- If @b is a constant, the compiler can issue the loads which depend
   on @a before loading @a.
- If @b is a register populated by a prior load, weakly-ordered
   CPUs can speculate loads which depend on @a before loading @a.

The same logic applies with @a and @b swapped.

The compiler barrier() is ineffective at fixing this issue.
It does not prevent the compiler CSE from losing the address dependency:

int fct_2_volatile_barriers(void)
{
     int *a, *b;

     do {
         a = READ_ONCE(p);
         asm volatile ("" : : : "memory");
         b = READ_ONCE(p);
     } while (a != b);
     asm volatile ("" : : : "memory");  <----- barrier()
     return *b;
}

With gcc 14.2 (arm64):

fct_2_volatile_barriers:
         adrp    x0, .LANCHOR0
         add     x0, x0, :lo12:.LANCHOR0
.L2:
         ldr     x1, [x0]    <------ x1 populated by first load.
         ldr     x2, [x0]
         cmp     x1, x2
         bne     .L2
         ldr     w0, [x1]    <------ x1 is used for access which should depend on b.
         ret

On weakly-ordered architectures, this lets CPU speculation use the
result from the first load to speculate "ldr w0, [x1]" before
"ldr x2, [x0]".
Based on the RCU documentation, the control dependency does not prevent
the CPU from speculating loads.

I recall seeing Nikita Popov (nikic) doing work related to this to LLVM
so it respects pointer provenances much better and doesn't randomly
replace pointers with others if they compare equal. So I tried to
reproduce this example on clang, which seems to generate the correct
code, loading from *b instead of *a.

The generated code with "ptr_eq" however produces one extra move
instruction which is not necessary.

I digged into the LLVM source code to see if this behaviour is that we
can rely on, and found that the GVN in use is very careful with
replacing pointers [1].

Essentially:
* null can be replaced
* constant addresses can be replaced <-- bad for this use case
* pointers originate from the same value (getUnderlyingObject).

So it appears to me that if we can ensure that neither a or b come
from a constant address then the OPTIMIZER_HIDE_VAR might be
unnecessary for clang? This should be testable with __builtin_constant_p.

Not necessary worth additional complexity handling clang specially, but
I think this is GCC/clang difference is worth pointing out.

Thanks for the thorough analysis of the clang GVN behavior. It confirms
my observations.

AFAIU, your proposal is to add a clang-specific #ifdef to eliminate one
mov from register to register (and thus free one register) when ptr_eq()
is used.

I'm not sure the gain (removing this extra mov) is worth it if what we
lose is robustness.

This would make the code dependent on current clang GVN optimization
design choices which are really specific to the compiler implementation
rather than guaranteed by the C standard. How can we be sure it won't
subtly break with a future clang version ?

If we think about it purely from a compiler optimization perspective,
using the content of the earliest loaded register allows weakly-ordered
CPUs to speculate following loads sooner. It's only when address
dependencies are needed (e.g. RCU) that this is unwanted. Am I missing
other cases where it is preferable to preserve address dependencies ?

Thanks,

Mathieu


I cc'ed nikic and clang-built-linux mailing list, please correct me if
I'm wrong.

[1]: https://github.com/llvm/llvm-project/blob/6558e5615ae9e6af6168b0a363808854fd66663f/llvm/lib/Analysis/Loads.cpp#L777-L788

Best,
Gary


Suggested-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Suggested-by: Boqun Feng <boqun.feng@xxxxxxxxx>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Reviewed-by: Boqun Feng <boqun.feng@xxxxxxxxx>
Acked-by: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
Cc: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
Cc: Will Deacon <will@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
Cc: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx>
Cc: John Stultz <jstultz@xxxxxxxxxx>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@xxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
Cc: Frederic Weisbecker <frederic@xxxxxxxxxx>
Cc: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx>
Cc: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
Cc: Uladzislau Rezki <urezki@xxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Lai Jiangshan <jiangshanlai@xxxxxxxxx>
Cc: Zqiang <qiang.zhang1211@xxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Mark Rutland <mark.rutland@xxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: maged.michael@xxxxxxxxx
Cc: Mateusz Guzik <mjguzik@xxxxxxxxx>
Cc: Gary Guo <gary@xxxxxxxxxxx>
Cc: Jonas Oberhauser <jonas.oberhauser@xxxxxxxxxxxxxxx>
Cc: rcu@xxxxxxxxxxxxxxx
Cc: linux-mm@xxxxxxxxx
Cc: lkmm@xxxxxxxxxxxxxxx
---
  include/linux/compiler.h | 62 ++++++++++++++++++++++++++++++++++++++++
  1 file changed, 62 insertions(+)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 2df665fa2964..f26705c267e8 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -186,6 +186,68 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
  	__asm__ ("" : "=r" (var) : "0" (var))
  #endif
  
+/*
+ * Compare two addresses while preserving the address dependencies for
+ * later use of the address. It should be used when comparing an address
+ * returned by rcu_dereference().
+ *
+ * This is needed to prevent the compiler CSE and SSA GVN optimizations
+ * from replacing the registers holding @a or @b based on their
+ * equality, which does not preserve address dependencies and allows the
+ * following misordering speculations:
+ *
+ * - If @b is a constant, the compiler can issue the loads which depend
+ *   on @a before loading @a.
+ * - If @b is a register populated by a prior load, weakly-ordered
+ *   CPUs can speculate loads which depend on @a before loading @a.
+ *
+ * The same logic applies with @a and @b swapped.
+ *
+ * Return value: true if pointers are equal, false otherwise.
+ *
+ * The compiler barrier() is ineffective at fixing this issue. It does
+ * not prevent the compiler CSE from losing the address dependency:
+ *
+ * int fct_2_volatile_barriers(void)
+ * {
+ *     int *a, *b;
+ *
+ *     do {
+ *         a = READ_ONCE(p);
+ *         asm volatile ("" : : : "memory");
+ *         b = READ_ONCE(p);
+ *     } while (a != b);
+ *     asm volatile ("" : : : "memory");  <-- barrier()
+ *     return *b;
+ * }
+ *
+ * With gcc 14.2 (arm64):
+ *
+ * fct_2_volatile_barriers:
+ *         adrp    x0, .LANCHOR0
+ *         add     x0, x0, :lo12:.LANCHOR0
+ * .L2:
+ *         ldr     x1, [x0]  <-- x1 populated by first load.
+ *         ldr     x2, [x0]
+ *         cmp     x1, x2
+ *         bne     .L2
+ *         ldr     w0, [x1]  <-- x1 is used for access which should depend on b.
+ *         ret
+ *

+ * On weakly-ordered architectures, this lets CPU speculation use the
+ * result from the first load to speculate "ldr w0, [x1]" before
+ * "ldr x2, [x0]".
+ * Based on the RCU documentation, the control dependency does not
+ * prevent the CPU from speculating loads.
+ */
+static __always_inline
+int ptr_eq(const volatile void *a, const volatile void *b)
+{
+	OPTIMIZER_HIDE_VAR(a);
+	OPTIMIZER_HIDE_VAR(b);
+	return a == b;
+}
+
  #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
  
  /**


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com