>From eb80c7c23661824d63bb8e97c951dce03e158769 Mon Sep 17 00:00:00 2001 From: Akira Yokosawa <akiyks@xxxxxxxxx> Date: Sun, 26 Mar 2017 16:39:51 +0900 Subject: [RFC PATCH 08/12] advsync: Properly use nbsp These nbsps prevent undesirable line breaks such as ... variable ------------ page/column break ------------ A ... That "A" might give a wrong impression as if it were an indefinite article. Single letter variable names and short numbers at the beginning of a line should be avoided in the same sense we prevent line breaks between "Figure" and "\ref{}". This commit also adds nbsps in lists of initial values as was done in commit 09f380606ca7 ("advsync: Properly use nbsp in initial values"). Some short conditionals are enclosed in \nbco{} to avoid line breaks. Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx> --- advsync/memorybarriers.tex | 336 ++++++++++++++++++++++----------------------- 1 file changed, 168 insertions(+), 168 deletions(-) diff --git a/advsync/memorybarriers.tex b/advsync/memorybarriers.tex index 64d78f4..173d7ad 100644 --- a/advsync/memorybarriers.tex +++ b/advsync/memorybarriers.tex @@ -32,13 +32,13 @@ in shared memory. For example, the litmus test in Table~\ref{tab:advsync:Memory Misordering: Dekker} appears to guarantee that the assertion never fires. -After all, if \co{r1 != 1}, we might hope that Thread~1's load from \co{y} -must have happened before Thread~2's store to \co{y}, which might raise -further hopes that Thread~2's load from \co{x} must happen after -Thread~1's store to \co{x}, so that \co{r2 == 1}, as required by the +After all, if \nbco{r1 != 1}, we might hope that Thread~1's load from~\co{y} +must have happened before Thread~2's store to~\co{y}, which might raise +further hopes that Thread~2's load from~\co{x} must happen after +Thread~1's store to~\co{x}, so that \nbco{r2 == 1}, as required by the assertion. The example is symmetric, so similar hopeful reasoning might lead -us to hope that \co{r2 != 1} guarantees that \co{r1 == 1}. +us to hope that \nbco{r2 != 1} guarantees that \nbco{r1 == 1}. Unfortunately, the lack of memory barriers in Table~\ref{tab:advsync:Memory Misordering: Dekker} dashes these hopes. @@ -138,7 +138,7 @@ Memory ordering and memory barriers can be extremely counter-intuitive. For example, consider the functions shown in Figure~\ref{fig:advsync:Parallel Hardware is Non-Causal} executing in parallel -where variables A, B, and C are initially zero: +where variables~A, B, and~C are initially zero: \begin{figure}[htbp] { \scriptsize @@ -173,10 +173,10 @@ where variables A, B, and C are initially zero: \label{fig:advsync:Parallel Hardware is Non-Causal} \end{figure} -Intuitively, \co{thread0()} assigns to B after it assigns to A, -\co{thread1()} waits until \co{thread0()} has assigned to B before -assigning to C, and \co{thread2()} waits until \co{thread1()} has -assigned to C before referencing A. +Intuitively, \co{thread0()} assigns to~B after it assigns to~A, +\co{thread1()} waits until \co{thread0()} has assigned to~B before +assigning to~C, and \co{thread2()} waits until \co{thread1()} has +assigned to~C before referencing~A. Therefore, again intuitively, the assertion on line~21 cannot possibly fire. @@ -185,7 +185,7 @@ and utterly incorrect. Please note that this is \emph{not} a theoretical assertion: actually running this code on real-world weakly-ordered hardware (a 1.5GHz 16-CPU POWER 5 system) resulted in the assertion firing -16 times out of 10 million runs. +16~times out of 10~million runs. Clearly, anyone who produces code with explicit memory barriers should do some extreme testing---although a proof of correctness might be helpful, the strongly counter-intuitive nature of the behavior of @@ -201,8 +201,8 @@ greatly \emph{increase} the probability of failure in this run. \emph{possibly} fail? \QuickQuizAnswer{ The key point is that the intuitive analysis missed is that - there is nothing preventing the assignment to C from overtaking - the assignment to A as both race to reach {\tt thread2()}. + there is nothing preventing the assignment to~C from overtaking + the assignment to~A as both race to reach \co{thread2()}. This is explained in the remainder of this section. } \QuickQuizEnd @@ -309,11 +309,11 @@ with the black regions to the left indicating the time before the corresponding CPU's first measurement. During the first 5ns, only CPU~3 has an opinion about the value of the variable. -During the next 10ns, CPUs~2 and 3 disagree on the value of the variable, -but thereafter agree that the value is ``2'', which is in fact +During the next 10ns, CPUs~2 and~3 disagree on the value of the variable, +but thereafter agree that the value is~``2'', which is in fact the final agreed-upon value. -However, CPU~1 believes that the value is ``1'' for almost 300ns, and -CPU~4 believes that the value is ``4'' for almost 500ns. +However, CPU~1 believes that the value is~``1'' for almost 300ns, and +CPU~4 believes that the value is~``4'' for almost 500ns. \QuickQuiz{} How could CPUs possibly have different views of the @@ -331,15 +331,15 @@ CPU~4 believes that the value is ``4'' for almost 500ns. } \QuickQuizEnd \QuickQuiz{} - Why do CPUs~2 and 3 come to agreement so quickly, when it - takes so long for CPUs~1 and 4 to come to the party? + Why do CPUs~2 and~3 come to agreement so quickly, when it + takes so long for CPUs~1 and~4 to come to the party? \QuickQuizAnswer{ - CPUs~2 and 3 are a pair of hardware threads on the same + CPUs~2 and~3 are a pair of hardware threads on the same core, sharing the same cache hierarchy, and therefore have very low communications latencies. This is a NUMA, or, more accurately, a NUCA effect. - This leads to the question of why CPUs~2 and 3 ever disagree + This leads to the question of why CPUs~2 and~3 ever disagree at all. One possible reason is that they each might have a small amount of private cache in addition to a larger shared cache. @@ -350,17 +350,17 @@ CPU~4 believes that the value is ``4'' for almost 500ns. And if you think that the situation with four CPUs was intriguing, consider Figure~\ref{fig:advsync:A Variable With More Simultaneous Values}, -which shows the same situation, but with 15 CPUs each assigning their -number to a single shared variable at time $t=0$. Both diagrams in the +which shows the same situation, but with 15~CPUs each assigning their +number to a single shared variable at time~$t=0$. Both diagrams in the figure are drawn in the same way as Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values}. The only difference is that the unit of horizontal axis is timebase ticks, -with each tick lasting about 5.3 nanoseconds. +with each tick lasting about 5.3~nanoseconds. The entire sequence therefore lasts a bit longer than the events recorded in Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values}, consistent with the increase in number of CPUs. The upper diagram shows the overall picture, while the lower one shows -the zoom-up of first 50 timebase ticks. +the zoom-up of first 50~timebase ticks. Again, CPU~0 coordinates the test, so does not record any values. @@ -371,10 +371,10 @@ Again, CPU~0 coordinates the test, so does not record any values. \ContributedBy{Figure}{fig:advsync:A Variable With More Simultaneous Values}{Akira Yokosawa} \end{figure*} -All CPUs eventually agree on the final value of 9, but not before -the values 15 and 12 take early leads. +All CPUs eventually agree on the final value of~9, but not before +the values~15 and~12 take early leads. Note that there are fourteen different opinions on the variable's value -at time 21 indicated by the vertical line in the lower diagram. +at time~21 indicated by the vertical line in the lower diagram. Note also that all CPUs see sequences whose orderings are consistent with the directed graph shown in Figure~\ref{fig:advsync:Possible Global Orders With More Simultaneous Values}. @@ -468,7 +468,7 @@ and CPU~4 sees the sequence {\tt \{4,2\}}. This is consistent with the global sequence {\tt \{3,1,4,2\}}, but also with all five of the other sequences of these four numbers that end -in ``2''. +in~``2''. Thus, there will be agreement on the sequence of values taken on by a single variable, but there might be ambiguity. @@ -477,7 +477,7 @@ In contrast, had the CPUs used atomic operations (such as the Linux kernel's unique values, their observations would be guaranteed to determine a single globally consistent sequence of values. One of the \co{atomic_inc_return()} invocations would happen first, -and would change the value from 0 to 1, the second from 1 to 2, and +and would change the value from~0 to~1, the second from~1 to~2, and so on. The CPUs could compare notes afterwards and come to agreement on the exact ordering of the sequence of \co{atomic_inc_return()} invocations. @@ -500,7 +500,7 @@ commercially available computer systems. Pair-wise memory barriers provide conditional ordering semantics. For example, in the following set of operations, CPU~1's access to -A does not unconditionally precede its access to B from the viewpoint +A does not unconditionally precede its access to~B from the viewpoint of an external logic analyzer \IfInBook{(see Appendix~\ref{chp:app:whymb:Why Memory Barriers?} for examples). @@ -508,9 +508,9 @@ of an external logic analyzer {(the system is only to act \emph{as if} the accesses are in order; it is not necessarily required to actually force them to be in order).} -However, if CPU~2's access to B sees the result of CPU~1's access -to B, then CPU~2's access to A is guaranteed to see the result of -CPU~1's access to A. +However, if CPU~2's access to~B sees the result of CPU~1's access +to~B, then CPU~2's access to~A is guaranteed to see the result of +CPU~1's access to~A. Although some CPU families' memory barriers do in fact provide stronger, unconditional ordering guarantees, portable code may rely only on this weaker if-then conditional ordering guarantee. @@ -608,7 +608,7 @@ pairings that portable software may depend on. In this pairing, one CPU executes a pair of loads separated by a memory barrier, while a second CPU executes a pair of stores also separated by a memory barrier, as follows - (both A and B are initially equal to zero): + (both~A and~B are initially equal to zero): \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -635,12 +635,12 @@ pairings that portable software may depend on. \co{X==1}. On the other hand, if \co{Y==0}, the memory-barrier condition - does not hold, and so in this case, X could be either 0 or 1. + does not hold, and so in this case, X~could be either~0 or~1. \paragraph{Pairing 2.} In this pairing, each CPU executes a load followed by a memory barrier followed by a store, as follows - (both A and B are initially equal to zero): + (both~A and~B are initially equal to zero): \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -667,7 +667,7 @@ pairings that portable software may depend on. so that \co{Y==0}. On the other hand, if \co{X==0}, the memory-barrier condition - does not hold, and so in this case, Y could be either 0 or 1. + does not hold, and so in this case, Y~could be either~0 or~1. The two CPUs' code sequences are symmetric, so if \co{Y==1} after both CPUs have finished executing these code sequences, @@ -677,7 +677,7 @@ pairings that portable software may depend on. In this pairing, one CPU executes a load followed by a memory barrier followed by a store, while the other CPU executes a pair of stores separated by a memory barrier, - as follows (both A and B are initially equal to zero): + as follows (both~A and~B are initially equal to zero): \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -701,11 +701,11 @@ pairings that portable software may depend on. Due to the pairwise nature of memory barriers, CPU~1's store following its memory barrier must therefore see the results of CPU~2's store preceding its memory barrier. - This means that CPU~1's store to B will overwrite CPU~2's - store to B, resulting in \co{B==1}. + This means that CPU~1's store to~B will overwrite CPU~2's + store to~B, resulting in \co{B==1}. On the other hand, if \co{X==0}, the memory-barrier condition - does not hold, and so in this case, B could be either 1 or 2. + does not hold, and so in this case, B~could be either~1 or~2. \subsubsection{Pair-Wise Memory Barriers: Semi-Portable Combinations} @@ -733,7 +733,7 @@ keep in mind that they used to be a \emph{lot} harder on some systems! one of the loads will see the value stored by the other thread in the ears-to-mouths scenario? \QuickQuizAnswer{ - The scenario is as follows, with A and B both initially zero: + The scenario is as follows, with~A and~B both initially zero: CPU~0: A=1; \co{smp_mb()}; r1=B; @@ -742,12 +742,12 @@ keep in mind that they used to be a \emph{lot} harder on some systems! If neither of the loads see the corresponding store, when both CPUs finish, both \co{r1} and \co{r2} will be equal to zero. Let's suppose that \co{r1} is equal to zero. - Then we know that CPU~0's load from B happened before CPU~1's - store to B: After all, we would have had \co{r1} equal to one + Then we know that CPU~0's load from~B happened before CPU~1's + store to~B: After all, we would have had \co{r1} equal to one otherwise. - But given that CPU~0's load from B happened before CPU~1's store - to B, memory-barrier pairing guarantees that CPU~0's store to A - happens before CPU~1's load from A, which in turn guarantees that + But given that CPU~0's load from~B happened before CPU~1's store + to~B, memory-barrier pairing guarantees that CPU~0's store to~A + happens before CPU~1's load from~A, which in turn guarantees that \co{r2} will be equal to one, not zero. Therefore, at least one of \co{r1} and \co{r2} must be nonzero, @@ -777,8 +777,8 @@ keep in mind that they used to be a \emph{lot} harder on some systems! Unfortunately, although this conclusion is correct on 21\textsuperscript{st}-century systems, it does not necessarily hold on all antique 20\textsuperscript{th}-century systems. - Suppose that the cache line containing A is initially owned - by CPU~2, and that containing B is initially owned by CPU~1. + Suppose that the cache line containing~A is initially owned + by CPU~2, and that containing~B is initially owned by CPU~1. Then, in systems that have invalidation queues and store buffers, it is possible for the first assignments to ``pass in the night'', so that the second assignments actually @@ -812,10 +812,10 @@ these combinations in order to fully understand how this works. (ignoring MMIO registers for the moment), it is not possible for one of the loads to see the results of the other load. - However, if we know that CPU~2's load from B returned a - newer value than CPU~1's load from B, then we also know - that CPU~2's load from A returned either the same value - as CPU~1's load from A or some later value. + However, if we know that CPU~2's load from~B returned a + newer value than CPU~1's load from~B, then we also know + that CPU~2's load from~A returned either the same value + as CPU~1's load from~A or some later value. \paragraph{Mouth to Mouth, Ear to Ear.} One of the variables is only loaded from, and the other @@ -826,12 +826,12 @@ these combinations in order to fully understand how this works. provided by the memory barrier. However, it is possible to determine which store happened - last, but this requires an additional load from B. - If this additional load from B is executed after both + last, but this requires an additional load from~B. + If this additional load from~B is executed after both CPUs~1 and~2 complete, and if it turns out that CPU~2's - store to B happened last, then we know - that CPU~2's load from A returned either the same value - as CPU~1's load from A or some later value. + store to~B happened last, then we know + that CPU~2's load from~A returned either the same value + as CPU~1's load from~A or some later value. \paragraph{Only One Store.} Because there is only one store, only one of the variables @@ -843,28 +843,28 @@ these combinations in order to fully understand how this works. At least not straightforwardly. But suppose that in combination~1 from Table~\ref{tab:advsync:Memory-Barrier Combinations}, - CPU~1's load from A returns the value that CPU~2 stored - to A. Then we know that CPU~1's load from B returned - either the same value as CPU~2's load from A or some later value. + CPU~1's load from~A returns the value that CPU~2 stored + to~A. Then we know that CPU~1's load from~B returned + either the same value as CPU~2's load from~A or some later value. \QuickQuiz{} How can the other ``Only one store'' entries in Table~\ref{tab:advsync:Memory-Barrier Combinations} be used? \QuickQuizAnswer{ - For combination~2, if CPU~1's load from B sees a value prior - to CPU~2's store to B, then we know that CPU~2's load from A - will return the same value as CPU~1's load from A, or some later + For combination~2, if CPU~1's load from~B sees a value prior + to CPU~2's store to~B, then we know that CPU~2's load from~A + will return the same value as CPU~1's load from~A, or some later value. - For combination~4, if CPU~2's load from B sees the value from - CPU~1's store to B, then we know that CPU~2's load from A - will return the same value as CPU~1's load from A, or some later + For combination~4, if CPU~2's load from~B sees the value from + CPU~1's store to~B, then we know that CPU~2's load from~A + will return the same value as CPU~1's load from~A, or some later value. - For combination~8, if CPU~2's load from A sees CPU~1's store - to A, then we know that CPU~1's load from B will return the same - value as CPU~2's load from A, or some later value. + For combination~8, if CPU~2's load from~A sees CPU~1's store + to~A, then we know that CPU~1's load from~B will return the same + value as CPU~2's load from~A, or some later value. } \QuickQuizEnd \subsubsection{Semantics Sufficient to Implement Locking} @@ -921,7 +921,7 @@ assert(b == 2); \QuickQuizAnswer{ If the CPU is not required to see all of its loads and stores in order, then the {\tt b=1+a} might well see an - old version of the variable ``a''. + old version of the variable~``a''. This is why it is so very important that each CPU or thread see all of its own loads and stores in program order. @@ -1090,28 +1090,28 @@ a few simple rules: for example, if a given CPU never loaded or stored the shared variable, then it can have no opinion about that variable's value.} -\item If one CPU does ordered stores to variables A and B,\footnote{ - For example, by executing the store to A, a - memory barrier, and then the store to B.} - and if a second CPU does ordered loads from B and A,\footnote{ - For example, by executing the load from B, a - memory barrier, and then the load from A.} - then if the second CPU's load from B gives the value stored - by the first CPU, then the second CPU's load from A must +\item If one CPU does ordered stores to variables~A and~B,\footnote{ + For example, by executing the store to~A, a + memory barrier, and then the store to~B.} + and if a second CPU does ordered loads from~B and~A,\footnote{ + For example, by executing the load from~B, a + memory barrier, and then the load from~A.} + then if the second CPU's load from~B gives the value stored + by the first CPU, then the second CPU's load from~A must give the value stored by the first CPU. -\item If one CPU does a load from A ordered before a store to B, - and if a second CPU does a load from B ordered before a store to A, - and if the second CPU's load from B gives the value stored by - the first CPU, then the first CPU's load from A must \emph{not} +\item If one CPU does a load from~A ordered before a store to~B, + and if a second CPU does a load from~B ordered before a store to~A, + and if the second CPU's load from~B gives the value stored by + the first CPU, then the first CPU's load from~A must \emph{not} give the value stored by the second CPU. -\item If one CPU does a load from A ordered before a store to B, - and if a second CPU does a store to B ordered before a - store to A, and if the first CPU's load from A gives +\item If one CPU does a load from~A ordered before a store to~B, + and if a second CPU does a store to~B ordered before a + store to~A, and if the first CPU's load from~A gives the value stored by the second CPU, then the first CPU's - store to B must happen after the second CPU's store to B, + store to~B must happen after the second CPU's store to~B, hence the value stored by the first CPU persists.\footnote{ Or, for the more competitively oriented, the first - CPU's store to B ``wins''.} + CPU's store to~B ``wins''.} \end{enumerate} The next section takes a more operational view of these rules. @@ -1142,7 +1142,7 @@ interface between the CPU and rest of the system (the dotted lines). For example, consider the following sequence of events given the -initial values {\tt \{A = 1, B = 2\}}: +initial values {\tt \{A~=~1, B~=~2\}}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -1199,7 +1199,7 @@ perceived by the loads made by another CPU in the same order as the stores were committed. As a further example, consider this sequence of events given the -initial values {\tt \{A = 1, B = 2, C = 3, P = \&A, Q = \&C\}}: +initial values {\tt \{A~=~1, B~=~2, C~=~3, P~=~\&A, Q~=~\&C\}}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -1215,8 +1215,8 @@ initial values {\tt \{A = 1, B = 2, C = 3, P = \&A, Q = \&C\}}: \vspace{5pt} There is an obvious data dependency here, -as the value loaded into \co{D} depends on -the address retrieved from \co{P} by CPU~2. +as the value loaded into~\co{D} depends on +the address retrieved from~\co{P} by CPU~2. At the end of the sequence, any of the following results are possible: @@ -1232,8 +1232,8 @@ following results are possible: \end{minipage} \vspace{5pt} -Note that CPU~2 will never try and load C into D because the CPU will load P -into Q before issuing the load of *Q. +Note that CPU~2 will never try and load~C into~D because the CPU will load~P +into~Q before issuing the load of~*Q. \subsection{Device Operations} \label{sec:advsync:Device Operations} @@ -1241,8 +1241,8 @@ into Q before issuing the load of *Q. Some devices present their control interfaces as collections of memory locations, but the order in which the control registers are accessed is very important. For instance, imagine an Ethernet card with a set of internal -registers that are accessed through an address port register (A) and a data -port register (D). To read internal register 5, the following code might then +registers that are accessed through an address port register~(A) and a data +port register~(D). To read internal register~5, the following code might then be used: \vspace{5pt} @@ -1594,7 +1594,7 @@ Section~\ref{sec:advsync:Device Operations}). \QuickQuiz{} What effect does the following sequence have on the - order of stores to variables ``a'' and ``b''? + order of stores to variables~``a'' and~``b''? \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -1607,9 +1607,9 @@ Section~\ref{sec:advsync:Device Operations}). \end{minipage} \QuickQuizAnswer{ Absolutely none. This barrier {\em would} ensure that the - assignments to ``a'' and ``b'' happened before any subsequent + assignments to~``a'' and~``b'' happened before any subsequent assignments, but it does nothing to enforce any order of - assignments to ``a'' and ``b'' themselves. + assignments to~``a'' and~``b'' themselves. } \QuickQuizEnd \subsubsection{What May Not Be Assumed About Memory Barriers?} @@ -1652,7 +1652,7 @@ of the confines of a given architecture: The usage requirements of data dependency barriers are a little subtle, and it's not always obvious that they're needed. To illustrate, consider the following sequence of events, with initial values -{\tt \{A = 1, B = 2, C = 3, P = \&A, Q = \&C\}}: +{\tt \{A~=~1, B~=~2, C~=~3, P~=~\&A, Q~=~\&C\}}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -1671,8 +1671,8 @@ following sequence of events, with initial values \vspace{5pt} There's a clear data dependency here, and it would seem intuitively -obvious that by the end of the sequence, \co{Q} must be either \co{&A} -or \co{&B}, and that: +obvious that by the end of the sequence, \co{Q}~must be either~\co{&A} +or~\co{&B}, and that: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -1686,8 +1686,8 @@ or \co{&B}, and that: \vspace{5pt} Counter-intuitive though it might be, it is quite possible that -CPU~2's perception of \co{P} might be updated \emph{before} its perception -of \co{B}, thus leading to the following situation: +CPU~2's perception of~\co{P} might be updated \emph{before} its perception +of~\co{B}, thus leading to the following situation: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -1705,7 +1705,7 @@ Alpha). To deal with this, a data dependency barrier must be inserted between the address load and the data load (again with initial values of -{\tt \{A = 1, B = 2, C = 3, P = \&A, Q = \&C\}}): +{\tt \{A~=~1, B~=~2, C~=~3, P~=~\&A, Q~=~\&C\}}): \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -1731,17 +1731,17 @@ Note that this extremely counterintuitive situation arises most easily on machines with split caches, so that, for example, one cache bank processes even-numbered cache lines and the other bank processes odd-numbered cache lines. -The pointer \co{P} might be stored in an odd-numbered cache line, and the -variable \co{B} might be stored in an even-numbered cache line. Then, if the +The pointer~\co{P} might be stored in an odd-numbered cache line, and the +variable~\co{B} might be stored in an even-numbered cache line. Then, if the even-numbered bank of the reading CPU's cache is extremely busy while the odd-numbered bank is idle, one can see the new value of the -pointer \co{P} (which is \co{&B}), -but the old value of the variable \co{B} (which is 2). +pointer~\co{P} (which is~\co{&B}), +but the old value of the variable~\co{B} (which is~2). Another example of where data dependency barriers might by required is where a number is read from memory and then used to calculate the index for an array access with initial values -{\tt \{M[0] = 1, M[1] = 2, M[3] = 3, P = 0, Q = 3\}}: +{\tt \{M[0]~=~1, M[1]~=~2, M[3]~=~3, P~=~0, Q~=~3\}}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -1799,7 +1799,7 @@ Consider the following bit of code: This will not have the desired effect because there is no actual data dependency, but rather a control dependency that the CPU may short-circuit by attempting to predict the outcome in advance, so that other CPUs see -the load from \co{y} as having happened before the load from \co{x}. +the load from~\co{y} as having happened before the load from~\co{x}. In such a case what's actually required is: \vspace{5pt} @@ -1834,13 +1834,13 @@ Control dependencies pair normally with other types of barriers. That said, please note that neither \co{READ_ONCE()} nor \co{WRITE_ONCE()} are optional! Without the \co{READ_ONCE()}, the compiler might combine the load -from \co{x} with other loads from \co{x}. -Without the \co{WRITE_ONCE()}, the compiler might combine the store to -\co{y} with other stores to \co{y}. +from~\co{x} with other loads from~\co{x}. +Without the \co{WRITE_ONCE()}, the compiler might combine the store +to~\co{y} with other stores to~\co{y}. Either can result in highly counterintuitive effects on ordering. Worse yet, if the compiler is able to prove (say) that the value of -variable \co{x} is always non-zero, it would be well within its rights +variable~\co{x} is always non-zero, it would be well within its rights to optimize the original example by eliminating the ``\co{if}'' statement as follows: @@ -1894,8 +1894,8 @@ optimization levels: \end{minipage} \vspace{5pt} -Now there is no conditional between the load from \co{x} and the store to -\co{y}, which means that the CPU is within its rights to reorder them: +Now there is no conditional between the load from~\co{x} and the store +to~\co{y}, which means that the CPU is within its rights to reorder them: The conditional is absolutely required, and must be present in the assembly code even after all compiler optimizations have been applied. Therefore, if you need ordering in this example, you need explicit @@ -1918,9 +1918,9 @@ memory barriers, for example, a release store: \vspace{5pt} The initial \co{READ_ONCE()} is still required to prevent the compiler from -proving the value of \co{x}. +proving the value of~\co{x}. -In addition, you need to be careful what you do with the local variable +In addition, you need to be careful what you do with the local variable~% \co{q}, otherwise the compiler might be able to guess the value and again remove the needed conditional. @@ -1942,7 +1942,7 @@ For example: \end{minipage} \vspace{5pt} -If \co{MAX} is defined to be 1, then the compiler knows that \co{(q\%MAX)} is +If \co{MAX} is defined to be~1, then the compiler knows that \co{(q\%MAX)} is equal to zero, in which case the compiler is within its rights to transform the above code into the following: @@ -1958,7 +1958,7 @@ transform the above code into the following: \vspace{5pt} Given this transformation, the CPU is not required to respect the ordering -between the load from variable \co{x} and the store to variable \co{y}. +between the load from variable~\co{x} and the store to variable~\co{y}. It is tempting to add a \co{barrier()} to constrain the compiler, but this does not help. The conditional is gone, and the barrier won't bring it back. @@ -1982,7 +1982,7 @@ that \co{MAX} is greater than one, perhaps as follows: \end{minipage} \vspace{5pt} -Please note once again that the stores to \co{y} differ. +Please note once again that the stores to~\co{y} differ. If they were identical, as noted earlier, the compiler could pull this store outside of the ``\co{if}'' statement. @@ -2042,9 +2042,9 @@ not necessarily apply to code following the if-statement: It is tempting to argue that there in fact is ordering because the compiler cannot reorder volatile accesses and also cannot reorder -the writes to \co{y} with the condition. +the writes to~\co{y} with the condition. Unfortunately for this line -of reasoning, the compiler might compile the two writes to \co{y} as +of reasoning, the compiler might compile the two writes to~\co{y} as conditional-move instructions, as in this fanciful pseudo-assembly language: @@ -2063,7 +2063,7 @@ language: \vspace{5pt} A weakly ordered CPU would have no dependency of any sort between the load -from \co{x} and the store to \co{z}. +from~\co{x} and the store to~\co{z}. The control dependencies would extend only to the pair of cmov instructions and the store depending on them. In short, control dependencies apply only to the stores in the ``\co{then}'' @@ -2071,8 +2071,8 @@ and ``\co{else}'' of the ``\co{if}'' in question (including functions invoked by those two clauses), not to code following that ``\co{if}''. Finally, control dependencies do \emph{not} provide transitivity. -This is demonstrated by two related examples, with the initial values of -\co{x} and \co{y} both being zero: +This is demonstrated by two related examples, with the initial values +of~\co{x} and~\co{y} both being zero: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -2114,7 +2114,7 @@ not), then adding the following CPU would guarantee a related assertion: But because control dependencies do \emph{not} provide transitivity, the above assertion can fail after the combined three-CPU example completes. If you need the three-CPU example to provide ordering, you will need -\co{smp_mb()} between the loads and stores in the CPU 0 and CPU 1 code +\co{smp_mb()} between the loads and stores in the CPU~0 and CPU~1 code fragments, that is, just before or just after the ``\co{if}'' statements. Furthermore, the original two-CPU example is very fragile and should be avoided. @@ -2270,7 +2270,7 @@ Figure~\ref{fig:advsync:Write Barrier Ordering Semantics}. Secondly, data dependency barriers act as partial orderings on data-dependent loads. Consider the following sequence of events with initial values -{\tt \{B = 7, X = 9, Y = 8, C = \&Y\}}: +{\tt \{B~=~7, X~=~9, Y~=~8, C~=~\&Y\}}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -2300,13 +2300,13 @@ shown in Figure~\ref{fig:advsync:Data Dependency Barrier Omitted}. \ContributedBy{Figure}{fig:advsync:Data Dependency Barrier Omitted}{David Howells} \end{figure*} -In the above example, CPU~2 perceives that \co{B} is 7, -despite the load of \co{*C} -(which would be \co{B}) coming after the \co{LOAD} of \co{C}. +In the above example, CPU~2 perceives that \co{B} is~7, +despite the load of~\co{*C} +(which would be~\co{B}) coming after the \co{LOAD} of~\co{C}. -If, however, a data dependency barrier were to be placed between the load of -\co{C} and the load of \co{*C} (i.e.: \co{B}) on CPU~2, again with initial -values of {\tt \{B = 7, X = 9, Y = 8, C = \&Y\}}: +If, however, a data dependency barrier were to be placed between the load +of~\co{C} and the load of~\co{*C} (i.e.:~\co{B}) on CPU~2, again with initial +values of {\tt \{B~=~7, X~=~9, Y~=~8, C~=~\&Y\}}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -2338,7 +2338,7 @@ Figure~\ref{fig:advsync:Data Dependency Barrier Supplied}. And thirdly, a read barrier acts as a partial order on loads. Consider the following sequence of events, with initial values -{\tt \{A = 0, B = 9\}}: +{\tt \{A~=~0, B~=~9\}}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -2367,9 +2367,9 @@ shown in Figure~\ref{fig:advsync:Read Barrier Needed}. \ContributedBy{Figure}{fig:advsync:Read Barrier Needed}{David Howells} \end{figure*} -If, however, a read barrier were to be placed between the load of \co{B} -and the load of \co{A} on CPU~2, again with initial values of -{\tt \{A = 0, B = 9\}}: +If, however, a read barrier were to be placed between the load of~\co{B} +and the load of~\co{A} on CPU~2, again with initial values of +{\tt \{A~=~0, B~=~9\}}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -2400,9 +2400,9 @@ Figure~\ref{fig:advsync:Read Barrier Supplied}. \end{figure*} To illustrate this more completely, consider what could happen if the code -contained a load of \co{A} either side of the read barrier, once again +contained a load of~\co{A} either side of the read barrier, once again with the same initial values of -{\tt \{A = 0, B = 9\}}: +{\tt \{A~=~0, B~=~9\}}: \vspace{5pt} \begin{minipage}[t]{\columnwidth} @@ -2422,8 +2422,8 @@ with the same initial values of \end{minipage} \vspace{5pt} -Even though the two loads of \co{A} -both occur after the load of \co{B}, they may both +Even though the two loads of~\co{A} +both occur after the load of~\co{B}, they may both come up with different values, as shown in Figure~\ref{fig:advsync:Read Barrier Supplied, Double Load}. @@ -2434,7 +2434,7 @@ Figure~\ref{fig:advsync:Read Barrier Supplied, Double Load}. \ContributedBy{Figure}{fig:advsync:Read Barrier Supplied, Double Load}{David Howells} \end{figure*} -Of course, it may well be that CPU~1's update to \co{A} becomes perceptible +Of course, it may well be that CPU~1's update to~\co{A} becomes perceptible to CPU~2 before the read barrier completes, as shown in Figure~\ref{fig:advsync:Read Barrier Supplied, Take Two}. @@ -2445,11 +2445,11 @@ Figure~\ref{fig:advsync:Read Barrier Supplied, Take Two}. \ContributedBy{Figure}{fig:advsync:Read Barrier Supplied, Take Two}{David Howells} \end{figure*} -The guarantee is that the second load will always come up with \co{A == 1} +The guarantee is that the second load will always come up with \nbco{A == 1} if the -load of \co{B} came up with \co{B == 2}. -No such guarantee exists for the first load of -\co{A}; that may come up with either \co{A == 0} or \co{A == 1}. +load of~\co{B} came up with \nbco{B == 2}. +No such guarantee exists for the first load +of~\co{A}; that may come up with either \nbco{A == 0} or \nbco{A == 1}. \subsubsection{Read Memory Barriers vs. Load Speculation} \label{sec:advsync:Read Memory Barriers vs. Load Speculation} @@ -2484,7 +2484,7 @@ For example, consider the following: On some CPUs, divide instructions can take a long time to complete, which means that CPU~2's bus might go idle during that time. -CPU~2 might therefore speculatively load \co{A} before the divides +CPU~2 might therefore speculatively load~\co{A} before the divides complete. In the (hopefully) unlikely event of an exception from one of the dividees, this speculative load will have been wasted, but in the (again, hopefully) @@ -2523,9 +2523,9 @@ dependent on the type of barrier used. If there was no change made to the speculated memory location, then the speculated value will just be used, as shown in Figure~\ref{fig:advsync:Speculative Loads and Barrier}. -On the other hand, if there was an update or invalidation to \co{A} +On the other hand, if there was an update or invalidation to~\co{A} from some other CPU, then the speculation will be cancelled and the -value of \co{A} will be reloaded, +value of~\co{A} will be reloaded, as shown in Figure~\ref{fig:advsync:Speculative Loads Cancelled by Barrier}. \begin{figure*}[htbp] @@ -2899,8 +2899,8 @@ operations concurrently: Given that operations grouped in curly braces are executed concurrently, which of the rows of Table~\ref{tab:advsync:Lock-Based Critical Sections} - are legitimate reorderings of the assignments to variables - ``A'' through ``F'' and the ACQUIRE/RELEASE operations? + are legitimate reorderings of the assignments to variables~``A'' + through~`F'' and the ACQUIRE/RELEASE operations? (The order in the code is {\tt *A, *B, ACQUIRE, *C, *D, RELEASE, *E, *F}.) Why or why not? \QuickQuizAnswer{ @@ -2908,18 +2908,18 @@ operations concurrently: \item Legitimate, executed in order. \item Legitimate, the lock acquisition was executed concurrently with the last assignment preceding the critical section. - \item Illegitimate, the assignment to ``F'' must follow the ACQUIRE + \item Illegitimate, the assignment to~``F'' must follow the ACQUIRE operation. \item Illegitimate, the ACQUIRE must complete before any operation in the critical section. However, the RELEASE may legitimately be executed concurrently with subsequent operations. - \item Legitimate, the assignment to ``A'' precedes the RELEASE, + \item Legitimate, the assignment to~``A'' precedes the RELEASE, as required, and all other operations are in order. - \item Illegitimate, the assignment to ``C'' must follow the ACQUIRE. - \item Illegitimate, the assignment to ``D'' must precede the RELEASE. + \item Illegitimate, the assignment to~``C'' must follow the ACQUIRE. + \item Illegitimate, the assignment to~``D'' must precede the RELEASE. \item Legitimate, all assignments are ordered with respect to the ACQUIRE and RELEASE operations. - \item Illegitimate, the assignment to ``A'' must precede the RELEASE. + \item Illegitimate, the assignment to~``A'' must precede the RELEASE. \end{enumerate} } \QuickQuizEnd @@ -2928,7 +2928,7 @@ Code containing multiple locks still sees ordering constraints from those locks, but one must be careful to keep track of which lock is which. For example, consider the code shown in Table~\ref{tab:advsync:Ordering With Multiple Locks}, which uses -a pair of locks named ``M'' and ``Q''. +a pair of locks named~``M'' and~``Q''. \begin{table}[htbp] \scriptsize\centering{\tt @@ -2947,7 +2947,7 @@ a pair of locks named ``M'' and ``Q''. \end{table} In this example, there are no guarantees as to what order the -assignments to variables ``A'' through ``H'' will appear in, other +assignments to variables~``A'' through~``H'' will appear in, other than the constraints imposed by the locks themselves, as described in the previous section. @@ -2986,11 +2986,11 @@ Table~\ref{tab:advsync:Ordering With Multiple CPUs on One Lock}? \label{tab:advsync:Ordering With Multiple CPUs on One Lock} \end{table} -In this case, either CPU~1 acquires M before CPU~2 does, or vice versa. -In the first case, the assignments to A, B, and C must precede -those to F, G, and H. +In this case, either CPU~1 acquires~M before CPU~2 does, or vice versa. +In the first case, the assignments to~A, B, and~C must precede +those to~F, G, and~H. On the other hand, if CPU~2 acquires the lock first, then the -assignments to E, F, and G must precede those to B, C, and D. +assignments to~E, F, and~G must precede those to~B, C, and~D. \subsection{The Effects of the CPU Cache} \label{sec:advsync:The Effects of the CPU Cache} @@ -3046,9 +3046,9 @@ Figure~\ref{fig:advsync:Split Caches}, in which each CPU has a split cache. This system has the following properties: \begin{enumerate} -\item An odd-numbered cache line may be in cache A, cache C, +\item An odd-numbered cache line may be in cache~A, cache~C, in memory, or some combination of the above. -\item An even-numbered cache line may be in cache B, cache D, +\item An even-numbered cache line may be in cache~B, cache~D, in memory, or some combination of the above. \item While the CPU core is interrogating one of its caches,\footnote{ But note that in ``superscalar'' systems, the CPU @@ -3067,7 +3067,7 @@ This system has the following properties: stores to cache lines affected by entries in those queues. \end{enumerate} -In short, if cache A is busy, but cache B is idle, then CPU~1's +In short, if cache~A is busy, but cache~B is idle, then CPU~1's stores to odd-numbered cache lines may be delayed compared to CPU~2's stores to even-numbered cache lines. In not-so-extreme cases, CPU~2 may see CPU~1's operations out -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe perfbook" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html