[RFC PATCH 1/7] advsync: Import 'TRANSITIVITY' section from memory-barriers.txt

Akira Yokosawa <akiyks@xxxxxxxxx> · Sat, 25 Mar 2017 18:35:50 +0900

>From 9df6a5d3eb65c67b79f8d15ee7928d617eae2c47 Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@xxxxxxxxx>
Date: Sat, 25 Mar 2017 17:42:06 +0900
Subject: [RFC PATCH 1/7] advsync: Import 'TRANSITIVITY' section from memory-barriers.txt

Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx>
---
 advsync/memorybarriers.tex | 175 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 175 insertions(+)

diff --git a/advsync/memorybarriers.tex b/advsync/memorybarriers.tex
index 15ed1f6..efc2cbd 100644
--- a/advsync/memorybarriers.tex
+++ b/advsync/memorybarriers.tex
@@ -2536,6 +2536,181 @@ as shown in Figure~\ref{fig:advsync:Speculative Loads Cancelled by Barrier}.
 \ContributedBy{Figure}{fig:advsync:Speculative Loads Cancelled by Barrier}{David Howells}
 \end{figure*}
 
+\subsubsection{Transitivity}
+\label{sec:advsync:Transitivity}
+
+``Transitivity'' is a deeply intuitive notion about ordering that is not
+always provided by real computer systems.  The following example
+demonstrates transitivity with initial values of
+{\tt \{X = 0, Y = 0\}}:
+
+\vspace{5pt}
+\begin{minipage}[t]{\columnwidth}
+\tt
+\scriptsize
+\begin{tabular}{l|l|l}
+	\nf{CPU 1}	& \nf{CPU 2}		& \nf{CPU 3} \\
+	\hline
+	X = 1;		& LOAD X		& Y = 1; \\
+			& <general barrier>	& <general barrier> \\
+			& LOAD Y		& LOAD X \\
+\end{tabular}
+\end{minipage}
+\vspace{5pt}
+
+Suppose that CPU~2's load from~\co{X} returns~1 and its load from~\co{Y} returns~0.
+This indicates that CPU~2's load from~\co{X} in some sense follows CPU~1's
+store to~\co{X} and that CPU~2's load from~\co{Y} in some sense preceded CPU~3's
+store to~\co{Y}.  The question is then ``Can CPU~3's load from~\co{X} return~0?''
+
+Because CPU~2's load from~\co{X} in some sense came after CPU~1's store, it
+is natural to expect that CPU~3's load from~\co{X} must therefore return~1.
+This expectation is an example of transitivity: if a load executing on
+CPU~A follows a load from the same variable executing on CPU~B, then
+CPU~A's load must either return the same value that CPU~B's load did,
+or must return some later value.
+
+In the Linux kernel, use of general memory barriers guarantees
+transitivity.  Therefore, in the above example, if CPU~2's load from~\co{X}
+returns~1 and its load from~\co{Y} returns~0, then CPU~3's load from~\co{X} must
+also return~1.
+
+However, transitivity is {\em not} guaranteed for read or write barriers.
+For example, suppose that CPU~2's general barrier in the above example
+is changed to a read barrier as shown below with the same initial values of
+{\tt \{X = 0, Y = 0\}}:
+
+\vspace{5pt}
+\begin{minipage}[t]{\columnwidth}
+\tt
+\scriptsize
+\begin{tabular}{l|l|l}
+	\nf{CPU 1}	& \nf{CPU 2}		& \nf{CPU 3} \\
+	\hline
+	X = 1;		& LOAD X		& Y = 1; \\
+			& <read barrier>	& <general barrier> \\
+			& LOAD Y		& LOAD X \\
+\end{tabular}
+\end{minipage}
+\vspace{5pt}
+
+This substitution destroys transitivity: in this example, it is perfectly
+legal for CPU~2's load from~\co{X} to return~1, its load from~\co{Y} to return~0,
+and CPU~3's load from~\co{X} to return~0.
+
+The key point is that although CPU~2's read barrier orders its pair
+of loads, it does not guarantee to order CPU~1's store.  Therefore, if
+this example runs on a system where CPUs~1 and~2 share a store buffer
+or a level of cache, CPU~2 might have early access to CPU~1's writes.
+General barriers are therefore required to ensure that all CPUs agree
+on the combined order of CPU~1's and CPU~2's accesses.
+
+General barriers provide ``global transitivity'', so that all CPUs will
+agree on the order of operations.  In contrast, a chain of release-acquire
+pairs provides only ``local transitivity'', so that only those CPUs on
+the chain are guaranteed to agree on the combined order of the accesses.
+For example, switching to C code in deference to Herman Hollerith:
+
+{\scriptsize
+\begin{verbatim}
+   1 int u, v, x, y, z;
+   2
+   3 void cpu0(void)
+   4 {
+   5   r0 = smp_load_acquire(&x);
+   6   WRITE_ONCE(u, 1);
+   7   smp_store_release(&y, 1);
+   8 }
+   9
+  10 void cpu1(void)
+  11 {
+  12   r1 = smp_load_acquire(&y);
+  13   r4 = READ_ONCE(v);
+  14   r5 = READ_ONCE(u);
+  15   smp_store_release(&z, 1);
+  16 }
+  17
+  18 void cpu2(void)
+  19 {
+  20   r2 = smp_load_acquire(&z);
+  21   smp_store_release(&x, 1);
+  22 }
+  23
+  24 void cpu3(void)
+  25 {
+  26   WRITE_ONCE(v, 1);
+  27   smp_mb();
+  28   r3 = READ_ONCE(u);
+  29 }
+\end{verbatim}
+}
+
+Because \co{cpu0()}, \co{cpu1()}, and \co{cpu2()} participate in a local transitive
+chain of \co{smp_store_release()}-\co{smp_load_acquire()} pairs, the following
+outcome is prohibited:
+
+{\scriptsize
+\begin{verbatim}
+  r0 == 1 && r1 == 1 && r2 == 1
+\end{verbatim}
+}
+
+Furthermore, because of the release-acquire relationship between \co{cpu0()}
+and \co{cpu1()}, \co{cpu1()} must see \co{cpu0()}'s writes, so that the following
+outcome is prohibited:
+
+{\scriptsize
+\begin{verbatim}
+  r1 == 1 && r5 == 0
+\end{verbatim}
+}
+
+However, the transitivity of release-acquire is local to the participating
+CPUs and does not apply to \co{cpu3()}.  Therefore, the following outcome
+is possible:
+
+{\scriptsize
+\begin{verbatim}
+  r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
+\end{verbatim}
+}
+
+As an aside, the following outcome is also possible:
+
+{\scriptsize
+\begin{verbatim}
+  r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
+          && r5 == 1
+\end{verbatim}
+}
+
+Although \co{cpu0()}, \co{cpu1()}, and \co{cpu2()} will see their respective reads and
+writes in order, CPUs not involved in the release-acquire chain might
+well disagree on the order.  This disagreement stems from the fact that
+the weak memory-barrier instructions used to implement \co{smp_load_acquire()}
+and \co{smp_store_release()} are not required to order prior stores against
+subsequent loads in all cases.  This means that \co{cpu3()} can see \co{cpu0()}'s
+store to~\co{u} as happening {\em after} \co{cpu1()}'s load from~\co{v}, even though
+both \co{cpu0()} and \co{cpu1()} agree that these two operations occurred in the
+intended order.
+
+However, please keep in mind that \co{smp_load_acquire()} is not magic.
+In particular, it simply reads from its argument with ordering.  It does
+{\em not} ensure that any particular value will be read.  Therefore, the
+following outcome is possible:
+
+{\scriptsize
+\begin{verbatim}
+  r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
+\end{verbatim}
+}
+
+Note that this outcome can happen even on a mythical sequentially
+consistent system where nothing is ever reordered.
+
+To reiterate, if your code requires global transitivity, use general
+barriers throughout.
+
 \subsection{Locking Constraints}
 \label{sec:advsync:Locking Constraints}
 
-- 
2.7.4


--
To unsubscribe from this list: send the line "unsubscribe perfbook" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html